Need a way to eliminate duplicates

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

Need a way to eliminate duplicates

Author

Topic

sherrer

64 Posts

Posted - 2002-11-05 : 09:30:37

We have a table that has 40 million rows in SQL Server 7.0. Its size is over 3GB. It is loaded in from a text file using BCP. We are only loaded a few columns of data from the text file, causing many rows to be duplicated. If anyone has any ideas on the fastest way to remove duplicates, I would appreciate some feedback. Is there a way to eliminate them with DTS or BCP during the import? We have tried moving the data to a new table using a distinct *. This works, but I don't know if there is a faster way.

Thanks...

nr
SQLTeam MVY

12543 Posts

Posted - 2002-11-05 : 10:16:11

you could try creating an index with ignore dup key.

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

sherrer

64 Posts

Posted - 2002-11-05 : 12:01:52

Thanks, for some reason I thought BCP would produce an error if a key violation was encountered, but after testing it the test data loaded fine. Does anyone have comments on performance issues. Will the building of a unique index on 15 columns during the insert of this data be faster than doing the required table scans to eliminate the data after the fact.

Thanks

1fred
Posting Yak Master

158 Posts

Posted - 2002-11-05 : 12:14:25

Can't really see a faster way, since your data is in a file, you can't make a select on the file. In a dts you can make a bulk insert in a temp table for each file and then make a select distinct to insert into your table in a SQL task

sherrer

64 Posts

Posted - 2002-11-06 : 11:47:12

Ok, after playing around some...

We built an index on the table that will be used to insert all the data that had a unique index on all the fields. We had to set the ignore duplicate option or an error was raised and the entire insert was rolled back. After loading all the data, it was noticed that select count(*) from t1 returned a different value than the number of rows returned by select * from t1. It seems that all the data was inserted into the table(including the duplicate rows), but entries were not in the index for all the rows. The count query was doing a table scan while the select * was using the index. According to BOL this should not be the result, it said that the row would not be inserted (which is what we want), but is not the case.

The second option we had tried was to select distict rows into another table which we knew would work fine, but we basically double the size of the database. Although it's temporary, we are talking about 3 to 5 GB of data in the database and transaction log. We load the data once a month. The data goes into a tempory table in the database and is swapped with the production table with the old production table being dropped after the new table as been renamed. In other words the database grows to over 20GB when we are only trying to store 3GB of actually data.

Currently we are testing a script that will go through and delete the duplicates. We feel that this will probably be our most efficient way.

If anyone out there has experiance with moving around large chuncks of data at a time... I would love to know if there are any tricks that can be used to 1) save time and 2) keep the database to a minimal size.

Thanks

nr
SQLTeam MVY

12543 Posts

Posted - 2002-11-06 : 12:39:53

That doesn't happen when I try with straight inserts.
Sounds like a bug associated with bcp.

You could try making it a clustered index but that would slow the bcp down a lot. Would be interesting to know the result.

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources