Data Management

Removing Duplicate Records using SQL Server 2005

Removing duplicate records from tables is sometimes an arduous task, especially if the source table does not contain a primary key field. Here's how you can take advantage of a new SQL Server 2005 feature to remove duplicates quickly and efficiently.

Duplicate Records

Duplicate records can occur numerous ways, such as loading source files too many times, keying the same data more than once, or from just bad database coding. Having a primary key on your table (and you always should have one) can will in the removal of the duplicate records, but even w/ a primary key it is never a fun task to have handed to you to complete.

In today's example I will demonstrate how you can use a common-table expression (CTE) in 2005 to easily remove duplicate entries from a table, even when the table does not contain a primary key field. The script below creates the SalesHistory table, and loads 10 records into the table. Two of these records are duplicate entries, which I am defining as having the same values for the Product, SaleDate, and SalePrice fields. In the real world, these may not be duplicate records at all, but for our example we'll assume that they are.
DROP TABLE SalesHistory
CREATE TABLE [dbo].[SalesHistory]
        [Product] [varchar](10) NULL,
        [SaleDate] [datetime] NULL,
        [SalePrice] [money] NULL
INSERT INTO SalesHistory(Product, SaleDate, SalePrice)
SELECT 'Computer','1919-03-18 00:00:00.000',1008.00
SELECT 'BigScreen','1927-03-18 00:00:00.000',91.00
SELECT 'PoolTable','1927-04-01 00:00:00.000',139.00
SELECT 'Computer','1919-03-18 00:00:00.000',1008.00
SELECT 'BigScreen','1927-03-25 00:00:00.000',92.00
SELECT 'PoolTable','1927-03-25 00:00:00.000',108.00
SELECT 'Computer','1919-04-01 00:00:00.000',150.00
SELECT 'BigScreen','1927-04-01 00:00:00.000',	123.00
SELECT 'PoolTable','1927-04-01 00:00:00.000',	139.00
SELECT 'Computer','1919-04-08 00:00:00.000',	168.00

Now that I have some duplicate records loaded up, I can start writing my routine to remove them. The trick with removing duplicates is to find a way to delete them from your source table, rather than grouping all of the records in the source table together and creating a new table w/ the new entries. Deleting from the source usually requires a way to uniquely identify one of the unique records and then remove it. With some crafty TSQL, this is a relatively easy task to do when a primary key defined on the table. Luckily, the new CTE feature in SQL Server 2005 makes it very easy to remove these duplicates, with or without a primary key.

The script below defines my CTE. I am using a windowing function named DENSE_RANK to group the records together based on the Product, SaleDate, and SalePrice fields, and assign them a sequential value randomly. This means that if I have two records with the exact same Product, SaleDate, and SalePrice values, the first record will be ranked as 1, the second as 2, and so on.

;WITH SalesCTE(Product, SaleDate, SalePrice, Ranking)
	Product, SaleDate, SalePrice,
Ranking = DENSE_RANK() OVER(PARTITION BY Product, SaleDate, SalePrice ORDER BY NEWID() ASC)
FROM SalesHistory
WHERE Ranking > 1

Because a CTE acts as a virtual table, I am able to process data modification statements against it, and the underlying table will be affected. In this case, I am removing any record from the SalesCTE that is ranked higher than 1. This will remove all of my duplicate records.

To verify my duplicates have been removed, I can review the data in the table, which should now contain 8 records, rather than the previous 10.

FROM SalesHistory

Be Creative

Common-table expressions are a very useful new feature in SQL Server 2005. You can use them for recursive queries, removing duplicates, and even simple looping procedures. Explore this neat new feature to find some creative ways to solve some of your everyday database problems.

About Tim Chapman

Tim Chapman is a SQL Server MVP, a database architect, and an administrator who works as an independent consultant in Raleigh, NC, and has more than nine years of IT experience.

Editor's Picks

Free Newsletters, In your Inbox