Open Source

How to remove weird characters from file and directory names automatically

Marco Fioretti wrote a simple script to solve a pesky problem automatically. He shares his method of cleaning up bad folder and file names.

Personally, I have always tried to avoid files with non-alphanumeric names, because coping with them makes lots of my scripts harder to write than they could be.

Weird file and folder names cause me lots of troubles whenever I support my "computer-challenged" friends who run Linux. They like Free Software, but name their folders and files in the craziest and most wildly inconsistent ways. Besides making command line tasks unnecessarily difficult, sometimes their attitude caused annoying and hard-to-diagnose problems. I've seen OpenOffice on Ubuntu fail to open files that had been manually saved as "myfile.doc " (note the trailing space) and people get angry because "it won't open that darn .doc file!!!"

It was to cope with such situations that I wrote the script I present this week. It sanitizes automatically the names of all the sub-folders and files in a given directory, even if they have names as:

Photos: Canadair CL-215-6B11 CL-415 Aircraft Pictures | Airliners/civilian aircraft, /

Holiday:Me & Mike's family! , Phoenix ,/ 30-10:Panorama.jpg

The code is less than twenty lines:

       1  #! /bin/bash
       3  rm -f /tmp/clean_dir_file_names*
       4  cd $1
       5  find .  | awk '{ print length(), $0 | "sort -n -r" }' | \
       6          grep -v '^1 \.$' | cut -d/ -f2- > /tmp/clean_dir_file_names_1
       8  touch /tmp/clean_dir_file_names_2
       9  while read line
      10    do
      11    BASE=`basename "$line"`
      12    NEWBASE=`basename "$line" | perl -e '$N = <>; chomp ($N); $N =~ s/[^a-zA-Z0-9-_.]/_/g; $N =~ s/_+/_/g; $N= lc($N); $N =~ s/_([a-z])/_.uc($1)/eg; print ucfirst($N);' `
      13    if [ "$BASE" != "$NEWBASE" ]
      14    then
      15    OLDPATH=$(echo "$line" | sed -r 's/([^a-zA-Z0-9./_-])/\\\1/g')
      16    DIR=$(dirname "$line" | sed -r 's/([^a-zA-Z0-9./_-])/\\\1/g')
      17    echo "mv -i $OLDPATH $DIR/$NEWBASE" >> /tmp/clean_dir_file_names_2
      18    fi
      19  done </tmp/clean_dir_file_names_1
      20  exit

The basic processing flow is simple: find all the files and folders in the starting directory and give them, one at a time, a new name without any weird character. However, files and folders must be processed in the right order. If you consider a file with a weird name, inside several levels of weirdly named sub-folders:

pictures/ holidays!/2008, Dec. 1st: cruise/Hey, what's up?.jpg

you'll realize that the script must work bottom-up: for each branch in the folder directory, first rename all files at the deeper level than the folder they are in, then repeat the upper level. In the case shown above, this means doing something like:

mv pictures/ holidays/2008, Dec. 1st: cruise/Hey, what's up?.jpg pictures/ holidays/2008, Dec. 1st: cruise/Hey_what_s_up.jpg

mv pictures/ holidays!/2008, Dec. 1st: cruise pictures/ holidays!/2008_Dec_1st_cruise

mv pictures/ holidays! pictures/holidays

Working in any other order would cause the script to fail, unless it kept track of what names it had already changed or scanned all the folders again after every operation. Luckily, the solution is really simple to implement: the script must sort all the files and folders names by... string length, from longest to shortest. This is enough to guarantee that it changes the name of any folder only after it has changed those of all the files and sub-folders inside it. The sorting happens in line 5:

  • find all the files and folders
  • awk prints their names and their respective lengths
  • sort -n -r sorts all the resulting lines of text in numeric reverse order.

The grep and cut invocations in line 6 remove the line with the length of the top directory and strip all the string lengths from the output. The result, saved in the temporary file /tmp/clean_dir_file_names_1, will have this format:

  test/Gallery, personal/Recipes/basmati and shrimps, rice.jpeg
  test/Gallery, personal/airport Canadair, take-off.jpeg

The loop starting on line 9 reads this file one line at a time, generates all the renaming instructions and saves them into /tmp/clean_dir_file_names_2. It uses the basename and dirname commands, that do just what their names say. If you have some file in /some/folder/somewhere/myfile, basename will return "myfile" and dirname "/some/folder/somewhere".

Line 12, which calculates the new name of the current file or folder with a Perl one-liner, is the only part of this script that you should change to suit your taste. My version performs, in sequence, these operations:

  • remove trailing newline (chomp)
  • replace all characters except letters, digits, dash, underscore and period with underscores
  • replace consecutive underscores with only one of them
  • change everything to lower case (lc($N))
  • use the uc and ucfirst Perl functions to capitalize all the words in the file name: "this_is_ME_iN_1998.JPG" becomes This_Is_Me_In_1998.jpg"

If the new base name is different from the original one (line 13) then we have to generate a renaming command. However, just because we're dealing with non alphanumeric characters, we have to escape all of them. This is exactly what happens in lines 15 and 16: using regular expressions (-r), sed finds all the weird characters and escapes them with a slash. Running the script, you'll get a file full of valid shell commands of this kind:

  mv -i test/Gallery\, \personal/Recipes/basmati\ and\ shrimps\,\ rice.jpeg  test/Gallery\,\ personal/Recipes/basmati_and_shrimps_rice.jpeg

Typing source /tmp/clean_dir_file_names_2 at the prompt will rename everything at once. Yes, of course you could put that command in the script itself, but better have a look at which commands were generated before executing them, wouldn't you say?

The main, if not only limit I find in this script is that it replaces any letter not in the English alphabet. When dealing with names made of italian words, this means to "lose" the accented vowels, but is such a rare occurence that none of my users complained. With other languages this may be a bigger issue (if one wanted to preserve those characters, that is), but I have never needed to deal with it. If you have, please let me know how!


Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

Editor's Picks