Easy Join Question For All You Gurus

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

Easy Join Question For All You Gurus

Author

Topic

SqlZ
Yak Posting Veteran

69 Posts

Posted - 2002-05-22 : 13:45:53

This is the situation:

Table A
-----------------
ID

Table B
-----------------
ID1
ID2

I want to run a count and find the rows from Table A that are either in the ID1, or ID2 columns of Table B.

I wrote the code below and it seems to take a lot of time to return so I am turning to you guys. All of the fields listed are indexed as well and each table consists of roughly 1.5 million records.

SELECT Count(*) FROM A
INNER JOIN B ON A.ID=B.ID1 OR A.ID=B.ID2

Any way that you can think of doing this better would be greatly appreciated.

Thanks,
Anthony

setbasedisthetruepath
Used SQL Salesman

992 Posts

Posted - 2002-05-22 : 13:54:38

The OR is likely degrading performance.

try:
select count(*)
from tablea a
where exists ( select 1 from tableb where id1 = a.id )
or exists ( select 1 from tableb where id2 = a.id )

setBasedIsTheTruepath
<O>

MichaelP
Jedi Yak

2489 Posts

Posted - 2002-05-22 : 13:56:20

I think that's the fastest way to do it.

Maybe you could try this way (although I think it will be slower)
SELECT Count(*) FROM A
WHERE a.ID IN (B.ID1, B.ID2)

Michael

MichaelP
Jedi Yak

2489 Posts

Posted - 2002-05-22 : 13:57:17

Sniped by Set Based, and his answer looks pretty darn good. Try his first.

Michael

M.E.
Aged Yak Warrior

539 Posts

Posted - 2002-05-22 : 13:58:22

Not sure If I'd even use a join for this

Select A.ID, count(A.ID)
from tablea a, Tableb b
where a.id in (b.id1 ,b.id2)

Bleh, Sniped by 2 people... how sad. Gotta learn to type faster (or research quicker ;).. Best method would be to test em all

Edited by - M.E. on 05/22/2002 13:59:58

MichaelP
Jedi Yak

2489 Posts

Posted - 2002-05-22 : 14:10:03

We didn't get to be Yak Masters for nothing!
100% pure yak sniper :)

Michael

SqlZ
Yak Posting Veteran

69 Posts

Posted - 2002-05-22 : 16:17:41

setbasedisthetruepath's code was much faster. Thanks :)

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-05-22 : 17:04:26

Enter stage left: The notorious fribble to explain why the optimizer like SBTP's solution so much....

<O>

Lavos
Posting Yak Master

200 Posts

Posted - 2002-05-23 : 03:34:44

Well, if you _really_ want to know, the answer lies in the fact that he said both columns are indexed.

In SetBased's solution, it hits the indexes with a quick index seek to see if the value exists. In the others, it probably has to build a temporary result set, and then start running calculations. Check the execution plan to be sure.

----------------------
"O Theos mou! Echo ten labrida en te mou kephale!"

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2002-05-23 : 08:37:30

What I was mostly thinking is that

quote:

SELECT Count(*) FROM A
INNER JOIN B ON A.ID=B.ID1 OR A.ID=B.ID2

only gives the same result as

quote:

select count(*)
from tablea a
where exists ( select 1 from tableb where id1 = a.id )
or exists ( select 1 from tableb where id2 = a.id )

when A.ID may not occur in more than one row of B.
Which seems like a peculiar constraint when there are two columns in which it can occur.

(rest of this post is probably SQL Server 2000-specific)
I don't really understand the OR join plan because it uses merge interval, which is weird!

Another couple of ideas:


SELECT COUNT(*)
FROM A
WHERE id IN (SELECT id1 FROM B UNION SELECT id2 FROM B)

I get the same execution plan as Setty's EXIST. In both cases it's merge-concatenating the two columns of B. A smidgen faster (YMMV) was


SELECT COUNT(*)
FROM A
WHERE id IN (SELECT id1 FROM B)
   OR id IN (SELECT id2 FROM B)

which just did two semijoins and a filter.


SELECT COUNT(*)
FROM A
WHERE EXISTS (SELECT * FROM B WHERE id = id1 OR id = id2)

sucks badly for my test data, loop joining.

My test data consisted of 100000 rows in A -- integers 0..999999 and 100000 rows in B each consisting of two random integers in 0..999999 -- giving id1 and id2 63000-odd distinct values each and 100000 (usually) distinct pairs. Note that there is only a 37% (0.99999^100000) chance of rows in B with id1=id2, meaning that the first query mostly returns 200000!

Edited by - Arnold Fribble on 05/23/2002 08:57:43

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-05-23 : 08:41:10

I understand this has been asked and answered, but I am wondering if SqlZ can post the exact DDL used to create tablea and tableb as well as the indexes. Addionally, a sampling of the data would be nice to show relative selectivity. I am unable to reproduce the results.

I have ...


create table tablea (id int)
create table tableb (id1 int, id2 int)
create clustered index tablea_id on tablea(id)
create index tableb_id1 on tableb(id1)
create index tableb_id2 on tableb(id2)

declare @int int
set @int = 0

while @int < 10000
begin
    insert tablea values (@int)
    insert tableb values (@int, @int+1)
    set @int = @int + 2
end

I am trying the three suggestions here . . .


--Sqlz
select
    count(*)
from
    tablea a
    inner join tableb b
        on a.id = b.id1 or a.id = b.id2

--MichaelP / M.E.
select
    count(*)
from
    tablea a
    inner join tableb b
        on a.id in (b.id1,b.id2)

--SBTP
select
    count(*)
from
    tablea a
where
    exists (select 1 from tableb where id1 = a.id) or
    exists (select 1 from tableb where id2 = a.id)

My SQL 7 (sp3) optimizer is showing a subtree cost for the first two of 0.539 and for SBTP's query, I am seeing subtree cost of 1.77 . . . additionally, and this is kinda freaking me out(edit: explained by frib), I get different counts for MP/ME vs. SBTP . . . I think the cost difference is in the interaction with tableb. MP/ME does a table scan at a cost of .068 while the two index seek(s) used by SBTPs query agains tableb costs .875 and .789, which accounts for the majority of the query time.

EDIT:
when I run


--fribble
select
    count(*)
from
    tablea a
where
    a.id  in (select id1 from tableb union select id2 from tableb)

given my schema and data, the optimizer chooses two index scans for tableb concatenated and hashed. This plan differs from SBTPs, and surprisingly gives the least expensive cost of any (.212).

<O>

Edited by - Page47 on 05/23/2002 08:50:49

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-05-23 : 08:54:05

another option to throw into the mix . . .


select
    count(*)
from
    tablea a
    inner join (
        select
            id1 as id
        from
            tableb
        union
        select
            id2 as id
        from
            tableb) b
        on a.id = b.id

<O>

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-05-23 : 08:59:59

so SqlZ, which is it?


tablea               tableb
-------              -------
ID = 47              ID1 = 47, ID2 = 47

would that count as two hits or one?

<O>

SqlZ
Yak Posting Veteran

69 Posts

Posted - 2002-06-04 : 10:33:20

I was on vacation for a week so forgive my late response to this thread, and thanks for all of the replies and interest in this. You guys are pretty darn good. In my reply before I stated that SBTP's code ran much faster than my JOIN using OR. I didn't really try the rest of the suggestions because at the time I checked back I only saw SBTP's response.

Page47, the question
"so SqlZ, which is it?

tablea tableb
------- -------
ID = 47 ID1 = 47, ID2 = 47

would that count as two hits or one?"

Does not really apply in this case because every record in tableb either has a match in either ID1, OR ID2, and does not have a match in both columns on the same record as you documented above. I am sorry for not posting sample data and making that more clear.

Again, thanks for all the replies!

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources