Open Source

How to remove weird characters from file and directory names automatically

Marco Fioretti wrote a simple script to solve a pesky problem automatically. He shares his method of cleaning up bad folder and file names.

Personally, I have always tried to avoid files with non-alphanumeric names, because coping with them makes lots of my scripts harder to write than they could be.

Weird file and folder names cause me lots of troubles whenever I support my "computer-challenged" friends who run Linux. They like Free Software, but name their folders and files in the craziest and most wildly inconsistent ways. Besides making command line tasks unnecessarily difficult, sometimes their attitude caused annoying and hard-to-diagnose problems. I've seen OpenOffice on Ubuntu fail to open files that had been manually saved as "myfile.doc " (note the trailing space) and people get angry because "it won't open that darn .doc file!!!"

It was to cope with such situations that I wrote the script I present this week. It sanitizes automatically the names of all the sub-folders and files in a given directory, even if they have names as:

Photos: Canadair CL-215-6B11 CL-415 Aircraft Pictures | Airliners/civilian aircraft, /

Holiday:Me & Mike's family! , Phoenix ,/ 30-10:Panorama.jpg

The code is less than twenty lines:

       1  #! /bin/bash
       2
       3  rm -f /tmp/clean_dir_file_names*
       4  cd $1
       5  find .  | awk '{ print length(), $0 | "sort -n -r" }' | \
       6          grep -v '^1 \.$' | cut -d/ -f2- > /tmp/clean_dir_file_names_1
       7
       8  touch /tmp/clean_dir_file_names_2
       9  while read line
      10    do
      11    BASE=`basename "$line"`
      12    NEWBASE=`basename "$line" | perl -e '$N = <>; chomp ($N); $N =~ s/[^a-zA-Z0-9-_.]/_/g; $N =~ s/_+/_/g; $N= lc($N); $N =~ s/_([a-z])/_.uc($1)/eg; print ucfirst($N);' `
      13    if [ "$BASE" != "$NEWBASE" ]
      14    then
      15    OLDPATH=$(echo "$line" | sed -r 's/([^a-zA-Z0-9./_-])/\\\1/g')
      16    DIR=$(dirname "$line" | sed -r 's/([^a-zA-Z0-9./_-])/\\\1/g')
      17    echo "mv -i $OLDPATH $DIR/$NEWBASE" >> /tmp/clean_dir_file_names_2
      18    fi
      19  done </tmp/clean_dir_file_names_1
      20  exit

The basic processing flow is simple: find all the files and folders in the starting directory and give them, one at a time, a new name without any weird character. However, files and folders must be processed in the right order. If you consider a file with a weird name, inside several levels of weirdly named sub-folders:

pictures/ holidays!/2008, Dec. 1st: cruise/Hey, what's up?.jpg

you'll realize that the script must work bottom-up: for each branch in the folder directory, first rename all files at the deeper level than the folder they are in, then repeat the upper level. In the case shown above, this means doing something like:

mv pictures/ holidays/2008, Dec. 1st: cruise/Hey, what's up?.jpg pictures/ holidays/2008, Dec. 1st: cruise/Hey_what_s_up.jpg

mv pictures/ holidays!/2008, Dec. 1st: cruise pictures/ holidays!/2008_Dec_1st_cruise

mv pictures/ holidays! pictures/holidays

Working in any other order would cause the script to fail, unless it kept track of what names it had already changed or scanned all the folders again after every operation. Luckily, the solution is really simple to implement: the script must sort all the files and folders names by... string length, from longest to shortest. This is enough to guarantee that it changes the name of any folder only after it has changed those of all the files and sub-folders inside it. The sorting happens in line 5:

  • find all the files and folders
  • awk prints their names and their respective lengths
  • sort -n -r sorts all the resulting lines of text in numeric reverse order.

The grep and cut invocations in line 6 remove the line with the length of the top directory and strip all the string lengths from the output. The result, saved in the temporary file /tmp/clean_dir_file_names_1, will have this format:

  test/Gallery, personal/Recipes/basmati and shrimps, rice.jpeg
  test/Gallery, personal/airport Canadair, take-off.jpeg
  ...

The loop starting on line 9 reads this file one line at a time, generates all the renaming instructions and saves them into /tmp/clean_dir_file_names_2. It uses the basename and dirname commands, that do just what their names say. If you have some file in /some/folder/somewhere/myfile, basename will return "myfile" and dirname "/some/folder/somewhere".

Line 12, which calculates the new name of the current file or folder with a Perl one-liner, is the only part of this script that you should change to suit your taste. My version performs, in sequence, these operations:

  • remove trailing newline (chomp)
  • replace all characters except letters, digits, dash, underscore and period with underscores
  • replace consecutive underscores with only one of them
  • change everything to lower case (lc($N))
  • use the uc and ucfirst Perl functions to capitalize all the words in the file name: "this_is_ME_iN_1998.JPG" becomes This_Is_Me_In_1998.jpg"

If the new base name is different from the original one (line 13) then we have to generate a renaming command. However, just because we're dealing with non alphanumeric characters, we have to escape all of them. This is exactly what happens in lines 15 and 16: using regular expressions (-r), sed finds all the weird characters and escapes them with a slash. Running the script, you'll get a file full of valid shell commands of this kind:

  mv -i test/Gallery\, \personal/Recipes/basmati\ and\ shrimps\,\ rice.jpeg  test/Gallery\,\ personal/Recipes/basmati_and_shrimps_rice.jpeg

Typing source /tmp/clean_dir_file_names_2 at the prompt will rename everything at once. Yes, of course you could put that command in the script itself, but better have a look at which commands were generated before executing them, wouldn't you say?

The main, if not only limit I find in this script is that it replaces any letter not in the English alphabet. When dealing with names made of italian words, this means to "lose" the accented vowels, but is such a rare occurence that none of my users complained. With other languages this may be a bigger issue (if one wanted to preserve those characters, that is), but I have never needed to deal with it. If you have, please let me know how!

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

11 comments
dold
dold

I tried detox, and I tried this script. (am I confused, or is harder than it ought to be to copy-paste? this isn't BASIC, so why are there line numbers?) Someone sent me a tar.Z file that contained some bad files. ls -b output here: \177 \177\177\177\030\030\030\030 ls -l 1057 Oct 8 16:30 ? 1058 Oct 8 16:03 ??????? The script posted here just leaves them as is. detox gives some odd graphics and says the files already exist. Cannot rename lib/??? to lib/: file already exists Cannot rename lib/????????????????????? to lib/: file already exists I manually moved them with mv ? one_bad_character mv ??????[!c] seven_bad_characters (because there happened to be another file with seven characters that ended in c) -- clarence

pgit
pgit

If you mounted a drive out of someone's windows machine on your Linux system, would this script work to clean up file names on the ntfs partition? The worst naming conventions I see are done by windows users.

mfioretti
mfioretti

When I first wrote this script, it was because I had never heard of detox (in spite of several searches I did make for name cleaning command line utilities). When I discovered it, I kept using my script, because it's more portable and flexible. It uses utilities surely available on every Linux distributions; it also fixes directory names, something that (at least, according to its own man page) detox doesn't do. Finally, if you know just a little bit of Perl you can rename files in any way you can put in a regular expression (for example this_is_ME_iN_1998.JPG??? to This_Is_Me_In_1998.jpg???) , which you can't with detox Thanks for mentioning it, though! Marco

cflange
cflange

You can also use 'detox *' or 'detox -r *' for recursive fixing. Unfortunately it also translates the accents from other languages (which may require option '-s iso8859_1' or '-s utf-8' to work correctly). Always test first with a dry-run (option -n).

mfioretti
mfioretti

... it makes it easier for me to point to several parts of the scripts, but you must remove them if you cut and paste the script code above into a file ! Sorry that this wasn't clear.

mfioretti
mfioretti

pgit, as far as I can tell, the script should work on every filesystem that Linux can read and write, it shouldn't matter which filesystem it is. Because it only changes _names_, that are supported on any file system, not permissions, links or other less universal stuff. This said, I have not had the possibility to try it on all the filesystems around.

dold
dold

detox does directories at least one level deep. $ detox --dry-run B* Bad Name@spaces -> Bad_Name_spaces Bad Name@spaces/What*kind!of stupid@name-is-this -> Bad Name@spaces/What_kind_of_stupid_name-is-this I hadn't heard of detox before seeing this posting, but I tried it on cygwin and linux. A little annoying to have to load lex and yacc, two of my least favorite topics, and the lowest grade I received in college ;-( but it handles a directory at the top level with a bad name. $ your_script 'Bad Name@spaces' line 7: cd: Bad: No such file or directory I had to back up to a good directory for this script to rename the "Bad" directory, as well as the underlying file that detox cleaned.

masarin
masarin

Hi I copied the code above, removed the line numbers and saved the file with the .sh extension. I tried using it but can't get it to change any file names. Could someone tell me how to run the script?

mfioretti
mfioretti

clean_tmp_files_2 is a list of commands. The way to execute them all at once is to type "source /tmp/clean_tmp_files_2" at the command prompt (or whatever else is the absolute path to that file)

masarin
masarin

"have you realized that the code in this page doesn't change file names but only generates another file with the commands to change those names, and that is that second file that you must run?" Ok I got it now :). I did not realize this. Another q. do you run the commands in the second file (clean_tmp_files_2) one by one in the konsole window, or can you run them automatically?

mfioretti
mfioretti

Masarin, have you realized that the code in this page doesn't change file names but only generates another file with the commands to change those names, and that is that second file that you must run? This said, in general these are scripts that you must run from a linux command line, that is from the textual interface of programs like konsole or Gnome Terminal. You must run the script with the code in this page by giving to it the name of the top folder that contains the files and other folders you want to sanitize. For example, if that folder is /home/masarin /weirdfiles and you saved the code in a file called scriptname.sh, you should type a command like this at the prompt: #> scriptname.sh /home/masarin/weirdfiles you should also make the script executable first, with this other command: #> chmod 755 scriptname.sh this will generte what I called clean_tmp_files_2. and that is the file you must run to change names If you have already done this, please describe with more details what is happening and pass some examples of file names that don't change HTH, Marco

Editor's Picks