Duplicate records in a database are bad. You can’t tolerate them. At the very least, they’ll produce misleading analysis. At the worst, they’ll totally wreck everything. Your application will probably run, but everything it generates will be suspect. You have to find them and delete them or your client might just delete you!
Of course, the applications you develop prevent duplicates from the get go — pat yourself on the back. However, your keen insight into the potential problems duplicates pose won’t help you if you inherit a legacy application or have to import dirty data. The good news is that finding and deleting duplicate records in Microsoft’s SQL Server isn’t a difficult task. However, determining just what constitutes a duplicate can be a bit tricky.
This blog post is also available in the PDF format in a TechRepublic Download.
The definition of duplicate
Before you start deleting records, you need to decide what the term duplicate means. Most of us would define any record where the data in one record matches another column per column (including or excluding the primary key) as a duplicate. The discussion becomes more complex as you narrow the focus from all columns to just a few.
In most cases, business rules, and not repetitive values, will determine what repetitive data creates duplicates. No definition is absolute and knowing your data is the key. That means there’s no silver bullet. The solution in Listing A is generic; you’ll have to adjust it to fit your situation.
SELECT * | DELETE FROM table
WHERE uniquefield IN
FROM table AS tmptable
WHERE table.field1 = tmptable.field1
[AND table.field2 = tmptable.field2
HAVING table.uniquefield > MIN(tmptable.uniquefield)
Table A lists the variables that you must update.
Table A: Update these variables
|The table from which you’re deleting duplicate records.|
|Any column that uniquely identifies each record. This can be a primary key or identity column.|
|An alias for the table created by a subquery.|
|The column(s) that comprise the duplicate.|
The SELECT NULL subquery retrieves records where the duplicate columns in the table and an aliased copy of the table match. The HAVING clause further filters those rows so that only those rows where the uniquefield value is greater than the minimum value for the other rows are returned.
You can think of the minimum value as the original record, but it doesn’t really matter. The code uses it as an anchor — it’s simply a place to start, regardless of the input order of the matching records. The two subqueries return duplicate records for each original record. Run the DELETE statement to delete those records from table. After deleting, only the record with the minimum uniquefield value remains in table.
Run the SELECT first and examine its results before you actually delete records. The AND clause is optional if you’re comparing a single column. For each additional column you add to the duplicate comparison, you must add an AND clause. Be sure to update the column references in the WHERE clause accordingly.
A quick example
The table in Figure A contains a few duplicate records. Specifically, three records repeat the same ZIPCode value. Two records repeat the same ZIPCode and City value.
This table contains duplicate records.
Running the SELECT statement shown in Figure B against this table returns two records. There’s one ZIPCode value that occurs three times, but the statement returns only those records where the uniqueidentifier — CodeID, in this case — is greater than that column’s minimum value. Figure C shows what happens when an AND clause compares both the ZIPCode and City columns. There’s only one record where both values are duplicated.
The SELECT statement returns two duplicate records.
Expanding the comparison reduces the number of duplicate records.
While this approach is simple, you’ll want to consider a few things:
- This statement isn’t dynamic, which means you’ll have to update the references each time you use it. If you delete duplicates regularly on different tables, you might want to find a different approach. If you run it regularly against the same table, consider converting it to a stored procedure.
- If you’re uncomfortable deleting records, make a copy of table and hang on to it for a while. Or, don’t delete the records at all. Instead, move them to an archive table, or flag them as deleted in the original table. Either way allows you to filter the records from current processing and maintain a historical perspective.
- This solution works on a single table. If you’re deleting records based on values in a related table, you need to consider far more than just duplicate values.
- If uniquefield is an identity column, you don’t have to worry about missing values. The code works just fine. Nor do the records have to be in any specific order.
Reduce duplicates and your work
You’ll find this solution for finding duplicate records dependable. It’s easy to implement because it requires no temporary tables and cursors. In addition, the simple code can be quickly updated.
Susan Sales Harkins is an independent consultant and the author of several articles and books on database technologies. Her most recent book is “Mastering Microsoft SQL Server 2005 Express,” with Mike Gunderloy, published by Sybex. Other collaborations with Mike Gunderloy are “Automating Microsoft Access 2003 with VBA,” “Upgrader’s Guide to Microsoft Office System 2003,” “ICDL Exam Cram 2,” and “Absolute Beginner’s Guide to Microsoft Access 2003” all by Que. Currently, Susan volunteers as the Publications Director for Database Advisors at http://www.databaseadvisors.com. You can reach her at firstname.lastname@example.org.