Author |
Topic |
shaggy
Posting Yak Master
248 Posts |
Posted - 2015-04-22 : 09:33:56
|
Hi friends,There is one report to identify potential duplicate in a table and it is performing poor.I'm now tuning the existing SP and got struck in modifiying it. Can anyone pls guide me to rewrite the query in a best way. I just pasted below an example of query which is now in a report.The report will be run every week currently the table has 10 million records, and every week there will 5k to 10k will be added up so with that 5k to 10 k we have to check all the 10 miilion rows that if it is duplciated the logic is (surname = surmane or forename = forename or DOB =DOB )Create table #employee(ID int,empid varchar(100),surname varchar(100),forename varchar(100),DOB datetime,empregistereddate datetime,Createdate datetime) Insert #employee select 1,'EMP1','Joe','Bonnie','1980-04-22 18:24:39.397','2014-08-11 18:33:00.920',getdate() union select 2,'EMP2','Joe','Corey','1984-04-12 18:24:39.397','2014-07-12 18:33:00.920',getdate() union select 3,'EMP3','Chase','Corner','1985-01-01 18:24:39.397','2014-06-13 18:33:00.920',getdate() union select 4,'EMP4','Yang','Corner','1980-03-25 18:24:39.397','2014-05-14 18:33:00.920',getdate() union select 5,'EMP5','McDowell','Julia','1990-08-22 18:24:39.397','2014-04-15 18:33:00.920',getdate() union select 6,'EMP6','Norton','Julia','1991-01-22 18:24:39.397','2014-03-16 18:33:00.920',getdate() union select 7,'EMP7','Mayo','Julia','2015-04-22 18:24:39.397','2014-02-17 18:33:00.920',getdate() union select 8,'EMP8','David','Allen','2015-04-22 18:24:39.397','2014-01-18 18:33:00.920',getdate() union select 9,'EMP9','David','Bonnie','1991-01-22 18:24:39.397','2015-01-01 18:33:00.920',getdate() union select 10,'EMP10','Monroe','Bonnie','1991-01-22 18:24:39.397','2015-04-22 18:33:00.920',getdate() select a.ID,DUP.id DUPid,a.empid,DUP.empid DUPempid,a.surname,DUP.surname DUPsurname,a.forename,DUP.forename DUPforename,a.DOB,DUP.DOB DUPdob,a.empregistereddate,DUP.empregistereddate DUPempregistereddate,a.Createdate,DUP.Createdate DUPCreatedateFROM #employee a INNER JOIN #employee DUP ON (a.empregistereddate between '2015-01-01 00:33:00.920' and '2015-05-01 18:33:00.920') AND ( (a.SURNAME = DUP.SURNAME ) OR (a.FORENAME = DUP.FORENAME ) OR a.DOB = DUP.DOB ) WHERE a.id <> DUP.id |
|
gbritton
Master Smack Fu Yak Hacker
2780 Posts |
Posted - 2015-04-22 : 10:31:44
|
Are the join and where predicates operating on indexed columns? |
|
|
Kristen
Test
22859 Posts |
Posted - 2015-04-22 : 11:21:25
|
I suspect that it is the OR (in the JOIN) which is making it slow.It might be faster (assuming columns in the JOIN predicates are indexed) to have three separate queries and UNION them together.select a.ID,DUP.id DUPid,a.empid,DUP.empid DUPempid,a.surname,DUP.surname DUPsurname, a.forename,DUP.forename DUPforename,a.DOB,DUP.DOB DUPdob,a.empregistereddate,DUP.empregistereddate DUPempregistereddate, a.Createdate,DUP.Createdate DUPCreatedateFROM ( SELECT [A_ID] = A.ID, [DUP_ID] = DUP.ID FROM #employee AS A JOIN #employee AS DUP ON DUP.id <> A.id AND DUP.SURNAME = a.SURNAME WHERE A.empregistereddate between '2015-01-01 00:33:00.920' and '2015-05-01 18:33:00.920' UNION SELECT [A_ID] = A.ID, [DUP_ID] = DUP.ID FROM #employee AS A JOIN #employee AS DUP ON DUP.id <> A.id AND a.FORENAME = DUP.FORENAME WHERE A.empregistereddate between '2015-01-01 00:33:00.920' and '2015-05-01 18:33:00.920' UNION SELECT [A_ID] = A.ID, [DUP_ID] = DUP.ID FROM #employee AS A JOIN #employee AS DUP ON DUP.id <> A.id AND a.DOB = DUP.DOB WHERE A.empregistereddate between '2015-01-01 00:33:00.920' and '2015-05-01 18:33:00.920') AS X JOIN #employee AS A ON A.id = X.A_ID JOIN #employee AS DUP ON DUP.id = X.DUP_ID On the tests I have run, on your sample data but with appropriate indexes, your code runs better than mine ... but on 10M rows it might be different.I would be interested in the Logical Reads / Scans - e.g. from this:SET STATISTICS IO ON... your actual query here ...SET STATISTICS IO OFF |
|
|
Kristen
Test
22859 Posts |
Posted - 2015-04-22 : 11:39:53
|
Yeah, I added 8,000 rows to the #employee (old with empregistereddate before the test period) and my query's logical reads was the same, whereas your query's jumped massively ... |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2015-04-22 : 14:44:59
|
[code]-- SwePesoSELECT a.ID, dup.ID AS DUPid, a.EmpID, DUP.EmpID AS DUPempid, a.Surname, dup.Surname AS DUPsurname, a.Forename, dup.Forename AS DUPforename, a.DOB, dup.DOB AS DUPdob, a.EmpRegisteredDate, dup.EmpRegisteredDate AS DUPempregistereddate, a.CreateDate, dup.CreateDate AS DUPCreatedateFROM #Employee AS aINNER JOIN #Employee AS dup ON dup.ID > a.IDWHERE a.EmpRegisteredDate BETWEEN '20150101 00:33:00.920' AND '20150501 18:33:00.920' AND (a.Surname = dup.Surname OR a.Forename = dup.Forename OR a.DOB = dup.DOB);[/code]Make sure there is an clustered index on ID column. Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA |
|
|
shaggy
Posting Yak Master
248 Posts |
Posted - 2015-04-23 : 01:08:25
|
Sorry for delay in reply.gbritton -Yes the Surname,forename,DOB,ID all are indexed seperately.Kristen - Yeh the OR clause is the culprit,I have tried your query and it got executed in 2 mins, where the original query will run for half a day.SwePeso - There is no clustered index on this table and your query is crossing the threshold duration. |
|
|
Kristen
Test
22859 Posts |
Posted - 2015-04-23 : 03:00:01
|
quote: Originally posted by shaggy There is no clustered index on this table
You ought to have a clustered index on every table (it helps with some housekeeping tasks). Doesn't have to be the Primary Key, although that is usually the best candidate e.g. for tables involved in JOINs |
|
|
shaggy
Posting Yak Master
248 Posts |
Posted - 2015-04-23 : 09:37:08
|
Thanks Kristen,I understand that but now we cannot able to change it overnight but thanks for pointing it. |
|
|
|
|
|