| Author |
Topic |
|
SqlZ
Yak Posting Veteran
69 Posts |
Posted - 2002-05-22 : 13:45:53
|
| This is the situation:Table A-----------------IDTable B-----------------ID1ID2I want to run a count and find the rows from Table A that are either in the ID1, or ID2 columns of Table B.I wrote the code below and it seems to take a lot of time to return so I am turning to you guys. All of the fields listed are indexed as well and each table consists of roughly 1.5 million records.SELECT Count(*) FROM AINNER JOIN B ON A.ID=B.ID1 OR A.ID=B.ID2Any way that you can think of doing this better would be greatly appreciated.Thanks,Anthony |
|
|
setbasedisthetruepath
Used SQL Salesman
992 Posts |
Posted - 2002-05-22 : 13:54:38
|
| The OR is likely degrading performance.try:select count(*)from tablea awhere exists ( select 1 from tableb where id1 = a.id ) or exists ( select 1 from tableb where id2 = a.id )setBasedIsTheTruepath<O> |
 |
|
|
MichaelP
Jedi Yak
2489 Posts |
Posted - 2002-05-22 : 13:56:20
|
| I think that's the fastest way to do it. Maybe you could try this way (although I think it will be slower)SELECT Count(*) FROM A WHERE a.ID IN (B.ID1, B.ID2)Michael |
 |
|
|
MichaelP
Jedi Yak
2489 Posts |
Posted - 2002-05-22 : 13:57:17
|
| Sniped by Set Based, and his answer looks pretty darn good. Try his first.Michael |
 |
|
|
M.E.
Aged Yak Warrior
539 Posts |
Posted - 2002-05-22 : 13:58:22
|
| Not sure If I'd even use a join for thisSelect A.ID, count(A.ID)from tablea a, Tableb bwhere a.id in (b.id1 ,b.id2)Bleh, Sniped by 2 people... how sad. Gotta learn to type faster (or research quicker ;).. Best method would be to test em allEdited by - M.E. on 05/22/2002 13:59:58 |
 |
|
|
MichaelP
Jedi Yak
2489 Posts |
Posted - 2002-05-22 : 14:10:03
|
| We didn't get to be Yak Masters for nothing!100% pure yak sniper :)Michael |
 |
|
|
SqlZ
Yak Posting Veteran
69 Posts |
Posted - 2002-05-22 : 16:17:41
|
| setbasedisthetruepath's code was much faster. Thanks :) |
 |
|
|
Page47
Master Smack Fu Yak Hacker
2878 Posts |
Posted - 2002-05-22 : 17:04:26
|
| Enter stage left: The notorious fribble to explain why the optimizer like SBTP's solution so much....<O> |
 |
|
|
Lavos
Posting Yak Master
200 Posts |
Posted - 2002-05-23 : 03:34:44
|
| Well, if you _really_ want to know, the answer lies in the fact that he said both columns are indexed.In SetBased's solution, it hits the indexes with a quick index seek to see if the value exists. In the others, it probably has to build a temporary result set, and then start running calculations. Check the execution plan to be sure.----------------------"O Theos mou! Echo ten labrida en te mou kephale!" |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2002-05-23 : 08:37:30
|
What I was mostly thinking is thatquote: SELECT Count(*) FROM A INNER JOIN B ON A.ID=B.ID1 OR A.ID=B.ID2
only gives the same result asquote: select count(*) from tablea a where exists ( select 1 from tableb where id1 = a.id ) or exists ( select 1 from tableb where id2 = a.id )
when A.ID may not occur in more than one row of B.Which seems like a peculiar constraint when there are two columns in which it can occur.(rest of this post is probably SQL Server 2000-specific)I don't really understand the OR join plan because it uses merge interval, which is weird!Another couple of ideas:SELECT COUNT(*)FROM AWHERE id IN (SELECT id1 FROM B UNION SELECT id2 FROM B) I get the same execution plan as Setty's EXIST. In both cases it's merge-concatenating the two columns of B. A smidgen faster (YMMV) wasSELECT COUNT(*)FROM AWHERE id IN (SELECT id1 FROM B) OR id IN (SELECT id2 FROM B) which just did two semijoins and a filter.SELECT COUNT(*)FROM AWHERE EXISTS (SELECT * FROM B WHERE id = id1 OR id = id2) sucks badly for my test data, loop joining.My test data consisted of 100000 rows in A -- integers 0..999999 and 100000 rows in B each consisting of two random integers in 0..999999 -- giving id1 and id2 63000-odd distinct values each and 100000 (usually) distinct pairs. Note that there is only a 37% (0.99999^100000) chance of rows in B with id1=id2, meaning that the first query mostly returns 200000!Edited by - Arnold Fribble on 05/23/2002 08:57:43 |
 |
|
|
Page47
Master Smack Fu Yak Hacker
2878 Posts |
Posted - 2002-05-23 : 08:41:10
|
I understand this has been asked and answered, but I am wondering if SqlZ can post the exact DDL used to create tablea and tableb as well as the indexes. Addionally, a sampling of the data would be nice to show relative selectivity. I am unable to reproduce the results.I have ...create table tablea (id int)create table tableb (id1 int, id2 int)create clustered index tablea_id on tablea(id)create index tableb_id1 on tableb(id1)create index tableb_id2 on tableb(id2)declare @int intset @int = 0while @int < 10000begin insert tablea values (@int) insert tableb values (@int, @int+1) set @int = @int + 2end I am trying the three suggestions here . . .--Sqlzselect count(*)from tablea a inner join tableb b on a.id = b.id1 or a.id = b.id2--MichaelP / M.E.select count(*)from tablea a inner join tableb b on a.id in (b.id1,b.id2)--SBTPselect count(*)from tablea awhere exists (select 1 from tableb where id1 = a.id) or exists (select 1 from tableb where id2 = a.id) My SQL 7 (sp3) optimizer is showing a subtree cost for the first two of 0.539 and for SBTP's query, I am seeing subtree cost of 1.77 . . . additionally, and this is kinda freaking me out(edit: explained by frib), I get different counts for MP/ME vs. SBTP . . . I think the cost difference is in the interaction with tableb. MP/ME does a table scan at a cost of .068 while the two index seek(s) used by SBTPs query agains tableb costs .875 and .789, which accounts for the majority of the query time.EDIT:when I run--fribbleselect count(*)from tablea awhere a.id in (select id1 from tableb union select id2 from tableb) given my schema and data, the optimizer chooses two index scans for tableb concatenated and hashed. This plan differs from SBTPs, and surprisingly gives the least expensive cost of any (.212).<O>Edited by - Page47 on 05/23/2002 08:50:49 |
 |
|
|
Page47
Master Smack Fu Yak Hacker
2878 Posts |
Posted - 2002-05-23 : 08:54:05
|
another option to throw into the mix . . .select count(*)from tablea a inner join ( select id1 as id from tableb union select id2 as id from tableb) b on a.id = b.id <O> |
 |
|
|
Page47
Master Smack Fu Yak Hacker
2878 Posts |
Posted - 2002-05-23 : 08:59:59
|
so SqlZ, which is it?tablea tableb------- -------ID = 47 ID1 = 47, ID2 = 47 would that count as two hits or one?<O> |
 |
|
|
SqlZ
Yak Posting Veteran
69 Posts |
Posted - 2002-06-04 : 10:33:20
|
| I was on vacation for a week so forgive my late response to this thread, and thanks for all of the replies and interest in this. You guys are pretty darn good. In my reply before I stated that SBTP's code ran much faster than my JOIN using OR. I didn't really try the rest of the suggestions because at the time I checked back I only saw SBTP's response. Page47, the question "so SqlZ, which is it? tablea tableb------- -------ID = 47 ID1 = 47, ID2 = 47 would that count as two hits or one?"Does not really apply in this case because every record in tableb either has a match in either ID1, OR ID2, and does not have a match in both columns on the same record as you documented above. I am sorry for not posting sample data and making that more clear. Again, thanks for all the replies! |
 |
|
|
|