Linux

How to remove duplicate files without wasting time

Marco Fioretti provides some code snippets to streamline the search and removal of duplicate files on your computer.

Duplicate files can enter in your computer in many ways. No matter how it happened, they should be removed as soon as possible. Waste is waste: why should you tolerate it? It's not just a matter of principle: duplicates make your backups, not to mention indexing with Nepomuk or similar engines, take more time than it's really necessary. So let's get rid of them.

First, let's find which files are duplicates

Whenever I want to find and remove duplicate files automatically I run two scripts in sequence. The first is the one that actually finds which files are copies of each other. I use for this task this small gem by J. Elonen, pasted here for your convenience:

  #! /bin/bash
  OUTF=rem-duplicates.sh;
  echo "#! /bin/sh" > $OUTF;
  echo ""                >> $OUTF;
  find "$@" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
  chmod a+x $OUTF

In this script, which I call find_dupes.sh, all the real black magic happens in the sixth line. The original page explains all the details, but here is, in synthesis, what happens: first, xargs calculates the MD5 checksum of all the files found in all the folders passed as arguments to the script. Next, sort and uniq extract all the elements that have a common checksum (and are, therefore, copies of the same file) and build a sequence of shell commands to remove them. Several options inside the script, explained in the original page, make sure that things will work even if you have file names with spaces or non ASCII characters. The result is something like this (from a test run made on purpose for this article):

  [marco@polaris ~]$ find_dupes.sh /home/master_backups/rule /tmp/rule/
  [marco@polaris ~]$ more rem-duplicates.sh
  #! /bin/sh
  #rm /home/master_backups/rule/rule_new/old/RULE/public_html/en/test/makefile.pl
  #rm /tmp/rule/bis/rule_new/old/RULE/public_html/en/test/makefile.pl
  #rm /tmp/rule/rule_new/old/RULE/public_html/en/test/makefile.pl
  #rm /tmp/rule/zzz/rule_new/old/RULE/public_html/en/test/makefile.pl
  #all other duplicates...
As you can see, the script does find the duplicates (in the sample listing above, there are four copies of makefile.pl in three different folders) but lets you decide which one to keep and which ones to remove, that is, which lines you should manually uncomment before executing rem-duplicates.sh. This manual editing can consume so much time you'll feel like throwing the computer out of the window and going fishing.

Luckily, at least in my experience, this is almost never necessary. In practically all the cases in which I have needed to find and remove duplicates so far, there always was:

  • one original folder,(/home/master_backups/" in this example) whose content should remain untouched.
  • all the unnecessary copies scattered over many other, more or less temporary folders and subfolders (that, in our exercise, all are inside /tmp/rule/).

If that's the case, there's no problem to massage the output of the first script to generate another one that will leave alone the first copy in the master folder and remove all the others. There are many ways to do this. Years ago, I put together these few lines of Perl to do it and they serve me well, but you're welcome to suggest your preferred alternative in the comments:

  1 #! /usr/bin/perl
  2
  3 use strict;
  4 undef $/;
  5 my $ALL = <>;
  6 my @BLOCKS = split (/\n\n/, $ALL);
  7
  8    foreach my $BLOCKS (@BLOCKS) {
  9      my @I_FILE = split (/\n/, $BLOCKS);
  10    my $I;
  11    for ($I = 1; $I <= $#I_FILE; $I++) {
  12           substr($I_FILE[$I], 0,1) = '     ';
  13           }
  14   print join("\n", @I_FILE), "\n\n";
  15 }

This code puts all the text received from the standard input inside $ALL, and then splits it in @BLOCKS, using two consecutives newlines as blocks separator (line 6). Every element of each block is then split in one array of single lines (@I_FILE in line 9). Next, the first character of all but the first element of that array (which, if you've been paying attention, was the shell comment character, '#') is replaced by four white spaces. One would be enough, but code indentation is nice, isn't it?

When you run this second script (I call it dup_selector.pl) on the output of the first one, here's what you get:

  [marco@polaris ~]mce_markernbsp; ./new_dup_selector.pl rem-duplicates.sh > remove_copies.sh
  [marco@polaris ~]mce_markernbsp; more remove_copies.sh
  #! /bin/sh
  #rm /home/master_backups/rule/rule_new/old/RULE/public_html/en/test/makefile.pl
       rm /tmp/rule/bis/rule_new/old/RULE/public_html/en/test/makefile.pl
       rm /tmp/rule/rule_new/old/RULE/public_html/en/test/makefile.pl
       rm /tmp/rule/zzz/rule_new/old/RULE/public_html/en/test/makefile.pl
  ....

Which is exactly what we wanted, right? If the master folder doesn't have a name that puts it as the first element, you can temporarily change its name to something that will, like /home/0. What's left? Oh, yes, cleaning up! After you've executed remove_copies.sh, /tmp/rule will contain plenty of empty directories, that you want to remove before going there with your file manager and look at what's left without wasting time by looking inside empty boxes.

How to find and remove empty directories

Several websites suggest some variant of this command to find and remove all the empty subdirectories:

find -depth -type d -empty -exec rmdir {} \;

This goes down in the folder hierarchy (-depth), finds all the objects that are directories AND are empty (-type d -empty) and executes on them the rmdir command. It works... unless there is some directory with spaces or other weird characters in its name. That's why I tend to use a slightly more complicated command for this purpose:

  [marco@polaris ~]$ find .  -depth -type d -empty | while read line ; do  echo -n "rmdir '$line" ; echo "'"; done > rmdirs.sh
  [marco@polaris ~]$ cat rmdirs.sh
  rmdir 'rule/slinky_linux_v0.3.97b-vumbox/images'
  rmdir 'rule/slinky_linux_v0.3.97b-vumbox/RedHat/RPMS'
  ...
  [marco@polaris ~]$ source rmdirs.sh

Using the while loop creates a command file (rmdirs.sh) that wraps each directory name in single quotes, so that the rmdir command always receives one single argument. This always works... with the obvious exception of names that contain single quotes! Dealing with them requires some shell quoting tricks that... we'll cover in another post! For now, you know that whenever you have duplicate files to remove quickly, you can do it by using the two scripts shown here in sequence. Have fun!

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

12 comments
yohimanshu
yohimanshu

Hello Marco 

Thanks for the wonderful Script. For my case just comparing File size (num of bytes in two files) will do. Your Script does it much better.
Thanks again

AlexBanes
AlexBanes

Thank you


You can also check out DuplicateFilesDeleter program

kwutchak
kwutchak

Hi, thanks for a very useful article. In your last step, I wonder if there is a reason why you didn't use find -print 0 and xargs to handle the tricky directory names? Something like: find . -depth -type d -empty -print0 | xargs -0 -n1 rmdir

Barbelala
Barbelala

I am looking for a duplicate finder tool which has the feature of comparing two files by CRC and removing largest files. Anyone here can give you some suggestion?? I urgently need that kind of duplicate searching tool. Thanks

jawadsatti11
jawadsatti11

Thanks for the Great Idea I also Face this issue many time so i think you solve it. Termopane Veka

Ronaldvr
Ronaldvr

Fslint works just as easily within the graphical shell, and is included in nearly every distribution.

lamp19
lamp19

finding duplicates and removing them by simple scripts is always been easy. But practically, we may keep the same file in multiple locations for valid purposes. So, instead of just 'rm'ing the dups, I have always 'soft-linked' them to the original copy. I save huge space by avoiding dups, but still won't potentially break anything. hth.

nivas0522
nivas0522

it would be very useful if all duplicate files are sorted based on file-size, so that I can delete only those files which occupy more disk space. also, how about ignoring empty files? some servers use file-locks or /tmp/status_ok for specific requirements, deleting these empty files effect functionality.

olibre
olibre

@Barbelala Remove all the unwanted, duplicated files from your machine. Software name is DuplicateFilesDeleter.


Barbelala
Barbelala

Symlink dups? How to do that? Can you specify it?? Thanks.

mfioretti
mfioretti

lamp19, I see your point. However, in my experience, the "same file in multiple locations for valid purposes" thing happened to me many times, but only and always in specific, special directories (for example those where I compiled software). I handle those directories in other ways, including revision control systems. The scripts I explain here, instead, are specifically designed only for all those times and folders (e.g. archives of my articles) in which duplicates are completely useless, and the sooner they disappear the better; and I only use them in such folders. So I agree with you, it's just that in my own experience the "duplicates that have a purposes" and the "duplicates that have no purpose" never end up in the same folders. Marco

mfioretti
mfioretti

Nivas0522, In my mind, having ten copies of the same document is always absolutely bad regardless of its size, because it slows down backups, file searches and other operations. But if I need to keep a file, I need it regardless of its size. In other words, in my mind duplicates are a problem to solve by itself, regardless of size and recovering disk space. That's another issue that comes after, and that's why I have never considered adding a sorting function like the one you suggest. Besides, the reason I ignore empty files is that I only run these scripts on the folders that contain the documents that I create or recover from backups, not in the folders like /tmp that are used by the system to work.

Editor's Picks