Deleting duplicates from a large table

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

Deleting duplicates from a large table

Author

Topic

sherrer

64 Posts

Posted - 2002-12-09 : 13:47:50

I am using SQL 7.0.

I have a 5 GB (45,000,000 rows) that has duplicate rows. Currently, a sequential column as been added to the table as the primary key. There is only a small amount of duplication (without the primary key) and they need to be eliminated. We have a schema in place that uses a group by clause to find the duplicates and the max id. We use this information to join back to the original table to delete the duplicates by joining all the columns and excluding the id that was returned by the max id. It works, but is extremely time consuming, and causes a fairly large use of resources. We don't want to select distinct into a table, since it would move most of the data. I have posted this problem in the past, but would like to find if anyone has any new insight.

Thanks...

robvolk
Most Valuable Yak

15732 Posts

Posted - 2002-12-09 : 14:03:31

Is it possible to add a column to the table, like:

ALTER TABLE myTable ADD DupeFlag bit DEFAULT (0)

You can then use an UPDATE statement to flag the duplicate rows, and you can break up the operation so that it only has to process a small(er) set of rows, instead of the entire table. Once the UPDATE is done, you can do DELETE myTable WHERE DupeFlag=1, and you can also break that up into smaller operations. It might not be faster ultimately, but it will use fewer resources and won't fill the transaction log as quickly.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2002-12-09 : 14:45:48

Out of curiousity -- what is the kind of data in this table, and what are the fields? How do dups end up in there? This must not be a "one-time" clean up since you are concerned with speed.

- Jeff

sherrer

64 Posts

Posted - 2002-12-09 : 15:16:15

Rob:
I do agree that you're suggestion will keep a smaller transaction log, and will produce less I/O on disk. Ultimately we are looking for speed, as we expect to run this during low use times. We do not see how your suggestion will make the process faster, but I should have better conveyed that in the posting. Let me know if I am wrong, but with your idea I would still have to do the group by to determine the duplicates, then do an inner join update using all the columns to join, update one row from each group to leave one row for each group, then delete.

jsmith:
We get data from the post office monthly to certify mailing addresses. It comes in many text files, and contains much more information than we are interested in using. Since we do not want to write all of the unwanted information into the database from the text files, we only pull in what we want which causes duplication on some rows. We tried to create an index beforehand to eliminate the duplicates as we input, but it slowed the process down tremendously, and create errors at in the DTS package we were using. The errors can be handled, but we don't feel that it is a clean process to do so.

Almost all of the data that is imported is imported as text; see the structure below. There are two tables using CityStateCode to relate them. The seq column is put there only to give uniqueness to the columns, since we expect duplicates.

CREATE TABLE [dbo].[tmp_CityState] (
[Seq] [int] NOT NULL ,
[CityStateCode] [varchar] (6) NULL ,
[City] [varchar] (28) NULL ,
[State] [varchar] (2) NULL ,
[County] [varchar] (25) NULL
) ON [PRIMARY]

CREATE TABLE [dbo].[tmp_Zip4] (
[Seq] [int] NOT NULL ,
[CityStateCode] [varchar] (6) NULL ,
[AddressCode] [varchar] (1) NULL ,
[OddEvenCode] [varchar] (1) NULL ,
[Zip5] [varchar] (5) NULL ,
[Zip4High] [varchar] (4) NULL ,
[Zip4Low] [varchar] (4) NULL ,
[CarrierRoute] [varchar] (4) NULL ,
[StreetNoHigh] [varchar] (10) NULL ,
[StreetNoLow] [varchar] (10) NULL ,
[StreetNoHighAlt] [varchar] (10) NULL ,
[StreetNoLowAlt] [varchar] (10) NULL ,
[StreetNmPreDir] [varchar] (2) NULL ,
[StreetNm] [varchar] (28) NULL ,
[StreetNmSuffix] [varchar] (4) NULL ,
[StreetNmPostDir] [varchar] (2) NULL ,
[CompanyNm] [varchar] (40) NULL
) ON [PRIMARY]

Thanks for the help

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2002-12-10 : 11:08:46

The key to working with this sort of data is to establish some sort of vetting process and store mapping info about how and what data has been cleaned.

"rawdata + mappingtable + qa-process => qualitydata + pendingdata + rejecteddata"

then after the 1st load, you should only track changes to the new source of rawdata....98%+ of your addresses won't change (materially) so therefore there's no need to re-processed.

say you have 1000 addresses in rawdata
using some automated process you identify 800 unique (and of use) addresses, 150 duplicate addresses (paired), and 50 addresses that don't meet your standards re (automated) quality

then create links between the 800 records and the individual raw-address where they came from.
then create 75 unique records with pairings back to each source pair
then create 50 records linked to the 50 raw-addresses and queue for inspection by your administrators....

on rejection by admin, tag appropriately...next time they can be ignored....unless they improve "by magic" in the post-office system.
(maybe put a time-limit on them being ignored, so that they can be re-inspected after a sufficient time-interval that they may be considered to have been improved)

on acceptance/cleanup by admin, tag appropriately as cleaned data, and thus next time they don't need to be cleaned.

it's a virtuous cycle....just like tidying up your toys...clean up today, and while tomorrow you've another cleanup to do....it's a smaller problem....

main problem is making the initial cleanup effort manageable to make it look surmountable.........

sherrer

64 Posts

Posted - 2002-12-10 : 12:39:09

Thanks Andrew, this seems like a good idea. We will still have to load data into temp tables to work with it from the text files, but it seems that it may be easier to mananging inserts, updates, and deletes between the two sets up data rather than trying to find and eliminate duplicates. I do know that less data will be manipulated by this schema.

Thanks, Kevin

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources