Help with a Process - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

Help with a Process

Author

Topic

label
Posting Yak Master

197 Posts

Posted - 2003-01-23 : 10:22:41

I'm trying to delete all Duplicate rows in a table that has a Unique ID autocount field.

I've read the articles here on the sites but they dealt with deleting duplicates from Tables that did not have Unique ID's.

Currently, I've written a query that runs very slowly:

CREATE PROCEDURE dbo.ap_clean_duplicates AS

Begin

Create table #duprecords(mycount int, mydup varchar(1000))

insert into #duprecords
select count(modelname+configname+convert(varchar(20), requestdatetime)+convert(varchar(5), result)+[format]+version+username+interfacename), modelname+configname+convert(varchar(20), requestdatetime)+convert(varchar(5), result)+[format]+version+username+interfacename as myuid
from
smc_new_products.dbo.etechmodelrequests
group by
modelname+configname+convert(varchar(20), requestdatetime)+convert(varchar(5), result)+[format]+version+username+interfacename

Create table #dupes(req_id int)

Declare Cursid Cursor for
select distinct(mydup) from #duprecords where mycount>1

Declare @cid varchar(1000)

Open Cursid
Fetch Next from Cursid into @cid
while (@@fetch_status<>-1)

Begin

insert into #dupes
select top 1 requestID from
smc_new_products.dbo.etechmodelrequests
where
modelname+configname+convert(varchar(20), requestdatetime)+convert(varchar(5), result)+[format]+version+username+interfacename=@cid

Fetch Next from Cursid into @cid
End

delete from
smc_new_products.dbo.etechmodelrequests
where
requestID not in
(select req_id from #dupes)

drop table #duprecords
drop table #dupes
deallocate Cursid

End
GO

The table I'm trying to delete duplicates from has an AutoCount Primary Key Unique Id called "RequestId" and then about 12-15 other columns.

I appreciate any help anyone can offer and thank you in advance.

Robwhittaker
Yak Posting Veteran

85 Posts

Posted - 2003-01-23 : 10:41:44

This might not be the best way to do this, but you need a select that gives you a list of requestid to be deleted.

select a.requestid from table a, table b
where a.col = b.col
and a.col2 = b.col2
.
.
.
and a.requestid <> (select min(b.requestid) from table b)

then just delete this lot

delete from table
where requestid = (select a.requestid from table a, table b
where a.col = b.col
and a.col2 = b.col2
.
.
.
and a.requestid <> (select min(b.requestid) from table b)
)

You do need to worry about null fields, because if null=null it doesn't match so turn these in to "" or something.

Bewarned, just re-reading my SQL it's not totally correct when selecting the min requestid, needs to be for the same duplicates but that's why you run the inital select to make sure it gives ypu the records you want to delete, and don't for get to us a trans so you can always roll back

Edited by - robwhittaker on 01/23/2003 10:49:29

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-01-23 : 19:02:36

Try using GROUP BY to find duplicates. Much faster, easier to write, and handles Nulls just find.

DELETE FROM Table
WHERE ID Not In
(SELECT Min(ID) as ID FROM Table GROUP BY Field1, Field2, ....)

Be sure to group by all fields. The above will keep the min of each unique ID for all duplicates. Be sure to include all fields that need to unique in the GROUP BY clause (and exclude the ID, of course).

Run just the SELECT part to see what you are keeping; everything else will be deleted.

- Jeff

label
Posting Yak Master

197 Posts

Posted - 2003-01-24 : 08:53:10

quote:

Try using GROUP BY to find duplicates. Much faster, easier to write, and handles Nulls just find.

DELETE FROM Table
WHERE ID Not In
(SELECT Min(ID) as ID FROM Table GROUP BY Field1, Field2, ....)

Be sure to group by all fields. The above will keep the min of each unique ID for all duplicates. Be sure to include all fields that need to unique in the GROUP BY clause (and exclude the ID, of course).

Run just the SELECT part to see what you are keeping; everything else will be deleted.

- Jeff

I tried that and it pulled all the records......here's what I tried:

SELECT Min(RequestID) as [ID] FROM
Etechmodelrequests
GROUP BY
modelname,
configname,
intimestamp,
requestdatetime,
result,
serviceusername,
[format],
version,
username,
interfacename

The table I'm woring with is this:

CREATE TABLE [dbo].[EtechModelRequests] (
[RequestID] [int] IDENTITY (1, 1) NOT NULL ,
[modelname] [varchar] (256) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[configname] [varchar] (256) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[intimestamp] [varchar] (256) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[RequestDateTime] [datetime] NULL ,
[result] [int] NOT NULL ,
[serviceusername] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[format] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[version] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[username] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[interfacename] [varchar] (256) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
) ON [PRIMARY]
GO

label
Posting Yak Master

197 Posts

Posted - 2003-01-24 : 09:04:28

Ok, I was wrong, I'm not getting the exact same number but it's certainly not picking up all the duplicates. I've run 3 different attempts at getting just a raw count and am coming up with 3 different numbers.

Below are the different queries I've tried:

Thanks again for any insight anyone can provide.

select count(*) from etechmodelrequests
Result: 457663

SELECT Min(RequestID) as [ID] FROM
Etechmodelrequests
GROUP BY
modelname,
configname,
intimestamp,
requestdatetime,
result,
serviceusername,
[format],
version,
username,
interfacename
Result: 451447

select count(distinct (modelname+configname+intimestamp+convert(varchar(20), requestdatetime)+convert(varchar(20), result)+serviceusername+[format]+version+username+interfacename)) from etechmodelrequests
Result: 449318

select count(modelname+configname+convert(varchar(20), requestdatetime)+convert(varchar(5), result)+[format]+version+username+interfacename), modelname+configname+convert(varchar(20), requestdatetime)+convert(varchar(5), result)+[format]+version+username+interfacename as myuid
from
smc_new_products.dbo.etechmodelrequests
group by
modelname+configname+convert(varchar(20), requestdatetime)+convert(varchar(5), result)+[format]+version+username+interfacename
Result: 397886

label
Posting Yak Master

197 Posts

Posted - 2003-01-24 : 14:53:07

Surely someone's got an idea of how to do this or why I'm getting 3 different results here?

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-01-24 : 15:10:07

You're getting 3 different results because you're running 3 different queries. I don't know what's so surprising about that. Remember that GROUP BY will create groups for each null column, but concatenating a null in a string makes the entire expression null, no matter how many columns may have been null. That's why grouping all of the columns individually has the highest row count. It also explains why the two different count expression have different results, a null expression is not counted.

Why doesn't Jeff's solution work for you?

label
Posting Yak Master

197 Posts

Posted - 2003-01-24 : 15:26:42

quote:

Remember that GROUP BY will create groups for each null column, but concatenating a null in a string makes the entire expression null, no matter how many columns may have been null.

So your saying the last query I used is inaccurate because any null in any column caused the entire expression to return null?

Fair enough....so then the problem with both of the last two queries was that I was concatonating nulls into the string which threw off my totals.

(If this is an inaccurate interpretation of what you said please correct me as I'm not terribly experienced in this area)

quote:

Why doesn't Jeff's solution work for you?

Well, it may work indeed. I was assuming that I had 50,000 duplicates in the database (based on the last query in the post above) and that only having 7,000 wasn't reflecting all of the actual duplicates. However, after reading your explanation the exact opposite seems to be true and the 7,000 is right and the 50,000 wrong.

Thanks for your insight and help.

Edited by - label on 01/24/2003 15:27:10

Edited by - label on 01/24/2003 15:27:32

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources