Tables that don’t have a Primary Key constraint protecting
them can have duplicate rows inserted into them. The traditional way to locate
such duplicates is with a query using the GROUP BY and HAVING keywords. After
grouping the data on the key columns and counting the rows in each group, the
groups with more than one member are the ones with duplicates.
Although it’s easy to find such rows, cleaning up the
problem can be time consuming. In Oracle, the unique ROWID pseudocolumn
means that no two rows are really ever identical. You can always do a DELETE
query that references all of the ROWIDs except one to
eliminate the duplicates. That works fine—if you don’t have many duplicates to
remove. The analytic functions introduced in Oracle 9i give us an easier way to
do this cleanup.
The ROW_NUMBER() analytic function
is like the ROWNUM pseudocolumn in that it numbers
the output rows. However, ROWNUM is one unbroken sequence over the whole rowset, whereas ROW_NUMBER resets back to one for each
partition we define within the set. The result is an easy way to see not only
which groups have multiple members, but also which exact rows to eliminate.
The format of an analytical query is:
Functionname (arguments) OVER (PARTITION BY columns ORDER BY columns)
Let’s say that a mistake is made in creating a copy of the
SCOTT.EMP table and all the rows are entered twice. Attempting to add a Primary
Key constraint will fail because the data already has duplicates. Listing A
shows the process, divided into stages for clarity.
First, an analytic query is written that partitions by the empno column; it uses ROW_NUMBER()
to number the rows in each partition. If there are no duplicates, the partition
will have only one row, numbered “1”. But, if duplicates exist, they
will be given the numbers 2, 3, etc. This query also returns ROWIDs that we can use to uniquely identify rows. The first
query is then used as an inline view in another query that uses a WHERE clause
to screen out the “1” rows and return only the duplicates. Finally, a
DELETE statement uses the IN operator with the second query to delete all the
duplicates.
As with any large scale DELETE, keep in mind that you may be
better off copying the rows you want to keep (i.e., those with ROW_NUMBER of 1)
to a new table. INSERTs generate a lot less overhead
than DELETEs.
Miss a tip?
Check out the Oracle archive, and catch up on our most recent Oracle tips.
Bob Watkins (OCP, MCDBA, MCSE, MCT) is a computer professional with 25 years of experience as a technical trainer, consultant, and database administrator. Visit Bob’s site.