Monday, May 2, 2016

Duplicates: Discover and Delete with T-SQL

Occasionally, you need to locate and delete exact duplicate records in a SQL Server table. Prior to the introduction of the Row_Number function the process required a somewhat complex process often using the GROUP BY and  HAVING commands along with temporary staging tables.  Microsoft support has one possible approach at the page:  How to remove duplicate rows from a table in SQL Server.

Fortunately, with Row_Number, we now have a more elegant approach.

Initial Discovery


The first process is to establish the correct discovery query, using Row_Number.

       
   select [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
       ROW_NUMBER() over    
        ( partition by  [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
              order by  [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
         ) RowNumber
   from [dbo].[TargetTable];       
 

Here we are using the Row_Number function with both the partition by and order by commands.  All columns are listed, and listed three (3) times, in the same order.

       
   [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
        

The Duplicate Record List


Modifying the above query, we can filter down to just the offending duplicates. Essentially, we are just reconfiguring the initial query as a sub-query and then adding a WHERE clause. You always want to test the query before deleting records, and this is a simple way to test your sub-query format.


       
   select *
   from (
    select [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
        ROW_NUMBER() over    
         ( partition by  [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
               order by  [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
          ) RowNumber
    from [dbo].[TargetTable]
    ) as p
    where p.RowNumber > 1;       
 

Delete The offending Duplicates


Simply reconfigure your sub-query into a delete
       
   delete from p 
   from (
    select [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
        ROW_NUMBER() over    
         ( partition by  [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
               order by  [Column_1],[Column_2],[Column_3],[Column_4], [Column_etc]
          ) RowNumber
    from [dbo].[TargetTable]
    ) as p
    where RowNumber > 1 ;      
 

No comments: