Data Management

Removing Duplicate Records using SQL Server 2005

Removing duplicate records from tables is sometimes an arduous task, especially if the source table does not contain a primary key field. Here's how you can take advantage of a new SQL Server 2005 feature to remove duplicates quickly and efficiently.

Duplicate Records

Duplicate records can occur numerous ways, such as loading source files too many times, keying the same data more than once, or from just bad database coding. Having a primary key on your table (and you always should have one) can will in the removal of the duplicate records, but even w/ a primary key it is never a fun task to have handed to you to complete.

In today's example I will demonstrate how you can use a common-table expression (CTE) in 2005 to easily remove duplicate entries from a table, even when the table does not contain a primary key field. The script below creates the SalesHistory table, and loads 10 records into the table. Two of these records are duplicate entries, which I am defining as having the same values for the Product, SaleDate, and SalePrice fields. In the real world, these may not be duplicate records at all, but for our example we'll assume that they are.
IF OBJECT_ID('SalesHistory') IS NOT NULL
DROP TABLE SalesHistory
CREATE TABLE [dbo].[SalesHistory]
  (
        [Product] [varchar](10) NULL,
        [SaleDate] [datetime] NULL,
        [SalePrice] [money] NULL
  )
  GO
INSERT INTO SalesHistory(Product, SaleDate, SalePrice)
SELECT 'Computer','1919-03-18 00:00:00.000',1008.00
UNION ALL
SELECT 'BigScreen','1927-03-18 00:00:00.000',91.00
UNION ALL
SELECT 'PoolTable','1927-04-01 00:00:00.000',139.00
UNION ALL
SELECT 'Computer','1919-03-18 00:00:00.000',1008.00
UNION ALL
SELECT 'BigScreen','1927-03-25 00:00:00.000',92.00
UNION ALL
SELECT 'PoolTable','1927-03-25 00:00:00.000',108.00
UNION ALL
SELECT 'Computer','1919-04-01 00:00:00.000',150.00
UNION ALL
SELECT 'BigScreen','1927-04-01 00:00:00.000',	123.00
UNION ALL
SELECT 'PoolTable','1927-04-01 00:00:00.000',	139.00
UNION ALL
SELECT 'Computer','1919-04-08 00:00:00.000',	168.00

Now that I have some duplicate records loaded up, I can start writing my routine to remove them. The trick with removing duplicates is to find a way to delete them from your source table, rather than grouping all of the records in the source table together and creating a new table w/ the new entries. Deleting from the source usually requires a way to uniquely identify one of the unique records and then remove it. With some crafty TSQL, this is a relatively easy task to do when a primary key defined on the table. Luckily, the new CTE feature in SQL Server 2005 makes it very easy to remove these duplicates, with or without a primary key.

The script below defines my CTE. I am using a windowing function named DENSE_RANK to group the records together based on the Product, SaleDate, and SalePrice fields, and assign them a sequential value randomly. This means that if I have two records with the exact same Product, SaleDate, and SalePrice values, the first record will be ranked as 1, the second as 2, and so on.

;WITH SalesCTE(Product, SaleDate, SalePrice, Ranking)
AS
(
SELECT
	Product, SaleDate, SalePrice,
Ranking = DENSE_RANK() OVER(PARTITION BY Product, SaleDate, SalePrice ORDER BY NEWID() ASC)
FROM SalesHistory
)
DELETE FROM SalesCTE
WHERE Ranking > 1

Because a CTE acts as a virtual table, I am able to process data modification statements against it, and the underlying table will be affected. In this case, I am removing any record from the SalesCTE that is ranked higher than 1. This will remove all of my duplicate records.

To verify my duplicates have been removed, I can review the data in the table, which should now contain 8 records, rather than the previous 10.

SELECT *
FROM SalesHistory

Be Creative

Common-table expressions are a very useful new feature in SQL Server 2005. You can use them for recursive queries, removing duplicates, and even simple looping procedures. Explore this neat new feature to find some creative ways to solve some of your everyday database problems.

About

Tim Chapman is a SQL Server MVP, a database architect, and an administrator who works as an independent consultant in Raleigh, NC, and has more than nine years of IT experience.

7 comments
srice1969
srice1969

Brillinat - works a treat saved me many hours of work. Thanks

varshney4u
varshney4u

Great Article :) Thanx it help me a lot !!!

godfreyi
godfreyi

Thanks Tim, extremely helpful. Ian

ratheeshpatakal
ratheeshpatakal

Hello guys i want to enable sa userid from another normal user id please help me it is urgent.

Tony Hopkinson
Tony Hopkinson

the ability to impersonate another user is an sa privilege. If you could, it would be an appallingly bad idea, which is why you can't. What do you really want to do?