Linux

How to compare the content of two or more directories automatically

Marco Fioretti suggests some ways in Linux to automatically compare the contents of multiple directories in order to find missing, duplicate, or unwanted files.

Many of us end up, inevitably, with so many files and folders that it is impossible to keep them under control without some specialized help. Luckily, as I'll show you in a moment, under Linux there are several, very efficient solutions to this problem.

Multiple copies of many files, scattered all over the computer, waste space, create confusion, and slow down desktop indexers like DocFetcher. I have already explained how to find and remove the unwanted extra copies here.

When it comes time to clean up your folders and files, a common problem crops up: how can I find where duplicate files and folders exist between multiple directories?  The problem is both more complex and much more common than it may appear at first sight. A directory may contain many, many levels of sub-directories, each with thousands of files of all sorts. Trying to figure out manually the differences between two directory trees like those could take days.

One reason why you need to know the differences between directories is so you can ensure that all your backups are working as expected! What if the automated backup procedure you run every day has a bug? What if a sector of the external drive(s), DVDs, or remote computer to which you continuously copy all your precious folders suddenly (and silently) broke? Would you notice it before actually needing those backups? This is the main reason to be able to quickly find out if the contents of two folders differ. Let's see how to make this easy.

Automatic comparison

It is important to be able to run certain checks automatically from a shell script. Especially if all you want is a quick yes or no answer and automatic notifications. Here are a few command line utilities that you may use as a basis for scripts that perform such checks. You may then run those scripts either as automatic cron jobs, or whenever you feel like checking if that DVD or external drive is still free from errors.

find

This pipe of commands:

find $FOLDER -type f | cut -d/ -f2- | sort > /tmp/file_list_$FOLDER

will save in /tmp/file_list_$FOLDER an alphabetically ordered list of all the files inside $FOLDER, complete with the corresponding sub-folders, e.g. something like this:

  family/health_insurance.pdf
  family/holiday_quote.pdf
  pictures/2012/graduation.jpg
  work/linux-review.odt
Running the pipe on more directories and comparing the corresponding file lists will not find all the differences between them. You will only spot missing files, or folders containing sets of files with different names. Files with the same names and in the same subfolders, but with different content, will not show in the lists. Still, this may be a very quick way to spot certain mismatches.

diff

Diff is normally used to compare two files, but can do much more than that. The options "r" and "q" make it work recursively and quietly, that is, only mentioning differences, which is just what we are looking for:
  marco #> diff -rq todo_orig/ todo_backup/
  Only in todo_orig/essays: Digital-Citizenship-tech4engage-summit-report.pdf
  Files todo_orig/copyright/copyright_licensing.t2t and todo_sync/copyright/copyright_licensing.t2t differ
  diff: todo_orig/embedded_linux/init.d/led_driver: No such file or directory
  diff: todo_backup/embedded_linux/init.d/led_driver: No such file or directory
  Files todo_orig/strider/food/backpacking_food.t2t and todo_sync/strider/food/backpacking_food.t2t differ
  ...

As you can see, all the differences between two directory trees appear, be they files only present in one of them, or files that are different. Even files that, like "led_driver", are present in both folders but don't really exist, because they are links to other files that were canceled, are listed. Counting the number of lines generated by such an invocation of diff shows immediately if the two trees differ, as in this pseudo Bash code:

  DIFF_NUM=`diff -rq $DIR_1 $DIR_2 | wc -l`
  if [ "$DIFF_NUM" -gt "0" ]
     do
     # send me an email listing all the differences
     done

rsync

Rsync can produce a difference report that you may parse and use in the same way as the one from diff:

  marco #>rsync -rvnc --delete todo_sync/ todo_orig/
  sending incremental file list
  deleting essays/Digital-Citizenship-tech4engage-summit-report.pdf
  copyright/copyright_licensing.t2t
  skipping non-regular file "embedded_linux/init.d/led_driver"
  strider/food/backpacking_food.t2t
  sent 148763 bytes  received 473 bytes  27133.82 bytes/sec
  total size is 854518613  speedup is 5725.95 (DRY RUN)
The four command line switches r, v, c and n tell rsync (check the man page for details) to perform a verbose, recursive, checksum-based synchronization of the two directories, but only for show: -n, in fact, displays what rsync would do IF you did let it free to make the second folder a perfect copy of the first one. The huge advantage of rsync over rdiff is that the former can compare local directories with remote ones.

Interactive comparison

The command line is great, but not the best solutions for all jobs. The programs I already described are perfect to find differences without wasting time. Fixing those differences, however, may be more productive with a graphical, interactive interface like meld. This great little Python tool, available as binary package for several Gnu/Linux distributions, can do several things. One is to compare up to three folders simultaneously. After you select them as in Figure A, meld will display (Figure B) which files or folders are missing, or are different, in each of them.

Figure A

Figure B

Click to enlarge.
You will also be able to define your own comparison filters, to make your checks faster: Figure C shows the creation, in the "Edit->Preferences->File Filters" tab of meld, of a "photographs" filter that will only look at files with the .jpg or .jpeg extension.

Figure C

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

0 comments

Editor's Picks