Open Source

Rid your database of duplicate file entries with uniq

With the help of the Linux uniq command, getting rid of duplicate entries in your database is a snap! Let Jack Wallen, Jr., show you the way of uniq.


A colleague recently asked me how to get rid of duplicate database entries in Microsoft Access. Seeing as how I know next to nothing about Access, I decided to take a different approach to this problem by using open source tools.

In this Daily Feature, I will show you how I used a very small Linux tool, called uniq, to rid my colleague’s database (comprised of 55,000+ member names, addresses, phones, etc.) of unwanted duplicate entries.

The tool
The Linux uniq command is a very simple tool that removes duplicate entries from a text file. The syntax of the tool is uniq OPTION INPUT_FILE OUTPUT_FILE. There are a few switches that can be used with uniq. Of these switches, the most useful are:
  • ·        -c
    Prefix lines by number of occurrences
  • ·        -d
    Only print duplicate lines
  • ·        -f
    Avoid comparing the first N fields (where N is a user-defined positive integer)
  • ·        -i
    Ignore case
  • ·        -s
    Avoid comparing the first N characters (where N is a user-defined positive integer)
  • ·        -u
    Only print unique lines
  • ·        -w
    Compare no more than N character in each line (where N is a user-defined positive integer)

The setup
Since the uniq tool cannot act upon a binary file, I had to first export the Access database file into a comma-delineated list. With the database in this form, the uniq tool was able to function properly. Once the file was ftp'd from the Windows machine to a Linux server, the task could be completed.

To make this task a bit easier to complete, I ran the Linux sort command on the list. By doing this, it was much more readily apparent whether or not the unique command succeeded.

The usage
I stored the comma-delineated list, named database_duplicates.txt, in /usr/local. Then I sorted the list with the command sort database_duplicates > database_duplicates_sort.txt,which dumps the output of the sort command into a new file called database_duplicates_sort.txt. It was now time to run the command. After opening up a console window, I ran the command uniq database_duplicates_sort.txt database_duplicates_sort_uniq.txt.

That’s all it takes—almost.

Coloring within the lines
If I opened up the file at this point, I would still see some duplicate entries. Why? Looking closely at the duplicate entries you can see why they weren’t deleted:
Wallen, Jack, One TechRepublic Way
Wallen, Jack, 1 TechRepublic Way

Although I know these are duplicate entries, uniq sees the difference between 1and One and sees the above as two completely unique lines.

The uniq tool has a very effective way of solving this problem. Since I know that all the users in this list have only one entry (even though it may be listed in differing ways as shown above), and I know every entry begins with the users last name followed by the first name, I can tell uniq to only search the line to a certain character limit. Assuming that searching up to character 20 is a safe enough distance to determine a match, I run theuniq command with the -w switch as shown in Listing A.

I can check to see if the process has worked by running the uniq -d command, which will print out all remaining duplicate entries. Upon running the command uniq -d database_duplicates_sort_uniq.txt, no duplicate entries are reported. This means the process has been a success. The database can now be imported back into Access.

No limits
The uniq tool is not just limited to comma-delineated database lists. In fact, I have used this tool to search large source code to find duplicate line entries, to search larger table of contents, to detect how many entries were in a specific file, and numerous other applications.

Like many Linux applications, uniq can be used in conjunction with other commands. For example, it can be called from shell scripts or it can be set on a cron job to regularly strip files of duplicate entries. The uniq command, like much of using Linux, seems only to be limited in terms of your imagination.

About

Jack Wallen is an award-winning writer for TechRepublic and Linux.com. He’s an avid promoter of open source and the voice of The Android Expert. For more news about Jack Wallen, visit his website jackwallen.com.

Editor's Picks