Remove duplicates from table using SQL Rank() function

Having to de-duplicate rows from source tables is a very common task in the data integration world, we were trying to do all different of Tsql to filter out the duplicate values. We needed to pick the row based upon a certain criteria. On the internet i found the TSQL function Rank()

It works like this:

No matter where the source data is stored, as a part of the ETL workflow, we will stage source data in a SQL Server 2005 table.

This is the staging table:

create table SourceWithDuplicates
(SurrogateKey int identity(1,1),
Code varchar(12),
Description varchar(20),
OtherAttribute1 int,
OtherAttribute2 Varchar(10),
TieBreakerColumn Varchar(12))

As a general rule I always add a surrogate key column to my staging tables (that makes some DBAs happy); an in this case it would actually help in the de-duplication process.

Some sample data:

Insert into SourceWithDuplicates values(‘Abc’, ‘Row Description 1’, 1, ‘Other1′,’1’)
Insert into SourceWithDuplicates values(‘Abc’, ‘Row Description 1’, 1, ‘Other1’, ‘1’)
Insert into SourceWithDuplicates values(‘Abc’, ‘Row Description 1’, 1, ‘Other2′,’2’)
Insert into SourceWithDuplicates values(‘Def’, ‘Row Description 2’, 1, ‘Other4′,’a’)
Insert into SourceWithDuplicates values(‘Def’, ‘Row Description 2’, 1, ‘Other5′,’a’)
Insert into SourceWithDuplicates values(‘Ghi’, ‘Row Description 3’, 1, ‘Other5′,’a’)
Insert into SourceWithDuplicates values(‘Jkl’, ‘Row Description 4’, 1, ‘Other5′,’a’)

The requirements:

Code and description values need to be unique in the target table; in case they are duplicates, choose the row with the highest value in the tie breaker column. If the value in the tiebreaker column is also duplicated; then choose any of the rows.

So the data looks like:

SurrogateKey Code Description OtherAttribute1 OtherAttribute2 TieBreaker ———— ———— ——————– ————— ————— ———-
1 Abc Row Description 1 1 Other1 1
2 Abc Row Description 1 1 Other1 1
3 Abc Row Description 1 1 Other2 2
4 Def Row Description 2 1 Other4 a
5 Def Row Description 2 1 Other5 a
6 Ghi Row Description 3 1 Other5 a
7 Jkl Row Description 4 1 Other5 a

Following the requirements, only 4 rows kept should be chosen; those with surrogate keys:2 3, 4 or 5, 6 and 7
This is the query using the Rank() function that resolve our issue:

Select *,
rank() over (Partition by Code, description order by TieBreakerColumn desc, SurrogateKey) MyRank
from SourceWithDuplicates

and this is its output:

Surrogatekey Code Description MyRank
———— ———— ——————– ——————–
3 Abc Row Description 1 1
1 Abc Row Description 1 2
2 Abc Row Description 1 3
4 Def Row Description 2 1
5 Def Row Description 2 2
6 Ghi Row Description 3 1
7 Jkl Row Description 4 1

Notice that if we filter the result from the previous query using WHERE MyRank=1 we will get the rows to be kept.

Very nice ! I  found this at:

http://rafael-salas.blogspot.com/2007/04/remove-duplicates-using-t-sql-rank.html

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.