How to find identical record pairs? - SQL Server Forums

Home | Weblogs | Forums

Site Sponsored By: SQLDSC - SQL Server Desired State Configuration

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

How to find identical record pairs?

Author

Topic

SQLTim
Starting Member

21 Posts

Posted - 2006-04-10 : 18:37:54

This has to do with a 2D matrix.
The underlying table has two records for each cell.

ItemID Attribute
3 XL
3 Red
4 L
4 Blue
5 L
5 Red
6 L
6 Blue

The last pair of records are a duplicate and should not exist. How would I write a select that would give me a list of the record pairs that are duplicated in this way?
The result would be two records:
ItemID
4
6

Any ideas?

Thanks in advance!!
Tim

khtan
In (Som, Ni, Yak)

17689 Posts

Posted - 2006-04-10 : 19:12:45

Is there a 3rd column to identify what is the Attribute (Color or Size) ?

KH

nathans
Aged Yak Warrior

938 Posts

Posted - 2006-04-10 : 19:27:40

right, AttributeTypeID would be useful.

Tim, take a look at this query and see if you can get at the info youre seeking.


declare @Matrix table (ItemID int, AttributeTypeID int, AttributeValue varchar(10))
insert into @Matrix
	select 3, 1, 'XL'	union
	select 3, 2, 'Red'	union
	select 4, 1, 'L'	union
	select 4, 2, 'Blue'	union
	select 5, 1, 'L'	union
	select 5, 2, 'Red'	union
	select 6, 1, 'L' 	union
	select 6, 2, 'Blue'


select 	ItemID,
	max(case
		when AttributeTypeID = 1 then AttributeValue
		else NULL
	end) as 'ItemSize',
	max(case
		when AttributeTypeID = 2 then AttributeValue
		else NULL
	end) as 'ItemColor'
from @Matrix
group by ItemID

Nathan Skerl

SQLTim
Starting Member

21 Posts

Posted - 2006-04-10 : 19:28:08

I tried to simplify this. The second column actually points to a template table that can have many combinations of attributes.
Basically, not all 'XL's are the same.
It is really the more than one pair issue I'm going after.

I did try a solution where (heavily pseudo coded):
select ItemID, (select top 1 attrib)+(select top 1 attrib desc)
from myTable
group by 1,2
having count(*)>1

I used correlated queries to do it, and the results were close, but not exact.
It seems like there is another way to do something like this.

I'll post the query here but I fear it will distract from my main goal:
select department,category,item, match
from (
Select department,category,item,
(select top 1 t.value_name
from i_attlnk a
join i_tmplat t on a.tmplat_id = t.tmplat_id
where i.invent_id = a.invent_id
-- look at the top of the list
and atname_id in
(select top 1 atname_id
from i_attlnk a
join i_tmplat t on a.tmplat_id = t.tmplat_id
where a.invent_id = i.invent_id
order by atname_id)
)+
(select top 1 t.value_name
from i_attlnk a
join i_tmplat t on a.tmplat_id = t.tmplat_id
where i.invent_id = a.invent_id
-- now look at the bottom of the list
and atname_id in
(select top 1 atname_id
from i_attlnk a
join i_tmplat t on a.tmplat_id = t.tmplat_id
where a.invent_id = i.invent_id
order by atname_id desc)
) as match
from i_items i
) as qq
group by department,category,item, match
having count(*)>1

Thanks for looking at this!!

pootle_flump

1064 Posts

Posted - 2006-04-10 : 19:44:14

Without attribute types:

DECLARE @Matrix TABLE (ItemID Int, AttributeValue VarChar(10))
INSERT INTO @Matrix
	SELECT 3, 'XL'		UNION
	SELECT 3, 'Red'		UNION
	SELECT 4, 'L'		UNION
	SELECT 4, 'Blue'	UNION
	SELECT 5, 'L'		UNION
	SELECT 5, 'Red'		UNION
	SELECT 6, 'L' 		UNION
	SELECT 6, 'Blue'


SELECT 	A.ItemID AS Dupes
FROM 	(SELECT ItemID, MAX(AttributeValue) AS AttribA, MIN(AttributeValue) AS AttribB
	FROM	@Matrix
	GROUP BY
		ItemID) A 
		INNER JOIN 
	(SELECT ItemID, MAX(AttributeValue) AS AttribA, MIN(AttributeValue) AS AttribB
	FROM	@Matrix
	GROUP BY
		ItemID) B ON
	A.AttribA = B.AttribA 
	AND A.AttribB = B.AttribB
WHERE	A.ItemID <> B.ItemID 


Dupes
-----
6
4

(2 row(s) affected)

Although your last post hints that this isn't what you are after afterall...

SQLTim
Starting Member

21 Posts

Posted - 2006-04-10 : 20:30:26

pootle_flump, your results would be precisely what I am looking for.

Will let you know if it works out.

SQLTim
Starting Member

21 Posts

Posted - 2006-04-10 : 20:43:54

OK, let me try a version closer to the "truth"
It is essentially the same as before:
PriKey AttrID ItemID
6706 6722 3354
6707 6721 3354
6708 6722 3355
6709 6718 3355
6710 6722 3356
6711 6718 3356
6712 6722 3357
6713 6720 3357
12956 7608 6481
12957 6722 6481

Only the second and third columns really matter here.
Can you see how 3355 and 3356 are the duplicate pairs of records?

The Query:
SELECT A.invent_id AS Dupes
FROM (SELECT invent_id, MAX(tmplat_id) AS AttribA, MIN(tmplat_id) AS AttribB
FROM i_attlnk GROUP BY invent_id) A
INNER JOIN
(SELECT invent_id, MAX(tmplat_id) AS AttribA, MIN(tmplat_id) AS AttribB
FROM i_attlnk GROUP BY invent_id) B ON
A.AttribA = B.AttribA AND A.AttribB = B.AttribB
join i_items i on i.invent_id=a.invent_id
WHERE A.invent_id <> B.invent_id

gave me 56 records. and not two. Ideas?

khtan
In (Som, Ni, Yak)

17689 Posts

Posted - 2006-04-10 : 22:01:11

Try this code

declare	@table table
(
	PriKey 	int,
	AttrID 	int,
	ItemID	int
)
declare	@DupAttr table
(
	AttrID	int
)
declare	@DupItem table
(
	ItemID	int
)


insert into @table
select  6706, 6722, 3354	union all
select  6707, 6721, 3354	union all

select  6708, 6722, 3355	union all
select  6709, 6718, 3355	union all
select  6710, 6722, 3356	union all
select  6711, 6718, 3356	union all

select  6712, 6722, 3357	union all
select  6713, 6720, 3357	union all
select 12956, 7608, 6481	union all
select 12957, 6722, 6481

insert into @DupAttr
	select	AttrID
	from	@table
	group	by AttrID
	having count(*) > 1

insert into @DupItem
	select	ItemID
	from	@table t inner join @DupAttr d
		on	t.AttrID	= d.AttrID
	group by ItemID
	having count(*) > 1

select	t.*
from	@table t inner join @DupAttr a
	on	t.AttrID	= a.AttrID
	inner join @DupItem i
	on	t.ItemID	= i.ItemID

KH

RyanRandall
Master Smack Fu Yak Hacker

1074 Posts

Posted - 2006-04-11 : 05:58:10

Here's another possible way (if you can accept the tiny chance that it will return an incorrect result)...

--data
declare	@table table
(
	PriKey 	int,
	AttrID 	int,
	ItemID	int
)

insert into @table
select  6706, 6722, 3354	union all
select  6707, 6721, 3354	union all

select  6708, 6722, 3355	union all
select  6709, 6718, 3355	union all
select  6710, 6722, 3356	union all
select  6711, 6718, 3356	union all

select  1, 6722, 3358		union all
select  1, 6718, 3358		union all
select  1, 7608, 3358		union all

select  6712, 6722, 3357	union all
select  6713, 6720, 3357	union all
select 12956, 7608, 6481	union all
select 12957, 6722, 6481

--calculation
declare @SummaryTable table (ItemId int, AggregatedAttrID float)
insert @SummaryTable select ItemId, checksum_agg(AttrID) + sum(log(AttrID + 0.5)) from @table group by ItemId

select * from @SummaryTable where AggregatedAttrID in (
    select AggregatedAttrID from @SummaryTable group by AggregatedAttrID having count(ItemId) > 1)

It would be interesting to figure out how tiny the chance of incorrect results really is, and interesting to then figure out a way to set up a combination of aggregation functions to minimise that chance. Something for another time, methinks...

Ryan Randall
www.monsoonmalabar.com London-based IT consultancy

Solutions are easy. Understanding the problem, now, that's the hard part.

SQLTim
Starting Member

21 Posts

Posted - 2006-04-11 : 09:44:38

It looks like it is close. Was hoping I could adapt it to my situ.
I'm going to need to expand what is going on here one more step.

I've changed some field names around a bit to accomodate.
There is an i_items table that holds one record for each item.
The i_attlnk table holds the matrix linking information.
So I'm looking for pair duplicates inside of any one item...in this case, any one Item_Id.

From previous examples:
AttrID is now tmplat_Id
ItemID is now invent_id
and Item_ID represents an item that has many colors and sizes.

So a record in the i_items table uses two records in the i_attlnk table to tell it what color and size it is.

I've discovered that some application is creating duplicate record pairs and need to reslove this quickly as many of our clients are doing their physical inventories now.

What follows is an example of two items that each have the problem of duplicate record pairs. I've adapted your approach and it works for this record set, but it does not work when I turn it loose on the whole data set.

I don't quite get what your approach is and would love to hear your thoughts re:
checksum_agg(tmplat_Id) + sum(log(tmplat_Id + 0.5)

It's facinating tact. I think if I understood it better I might be able to adapt it for this situ more correctly. Any ideas what I did wrong here?

--data

declare @i_items table
(
Item_Id int,
Invent_Id int
)

insert into @i_items
select 12454, 3354 union all
select 12454, 3355 union all
select 12454, 3356 union all
select 12454, 3357 union all
select 12454, 6481 union all
select 3549 , 1692 union all
select 3549 , 1693 union all
select 3549 , 1694 union all
select 3549 , 1695

declare @i_attlnk table
(
tmplat_Id int,
invent_Id int
)

insert into @i_attlnk
select 6722, 3354 union all
select 6721, 3354 union all
select 6722, 3355 union all
select 6718, 3355 union all
select 6722, 3356 union all
select 6718, 3356 union all
select 6722, 3357 union all
select 6720, 3357 union all
select 7608, 6481 union all
select 6722, 6481 union all
select 6588, 1692 union all
select 6585, 1692 union all
select 6588, 1693 union all
select 6586, 1693 union all
select 6588, 1694 union all
select 6587, 1694 union all
select 6588, 1695 union all
select 6585, 1695

--calculation
declare @SummaryTable table (invent_Id int, AggregatedAttrID float)
insert @SummaryTable select a.invent_Id, checksum_agg(tmplat_Id) + sum(log(tmplat_Id + 0.5))
from @i_attlnk a
join @i_items i on i.invent_Id = a.invent_Id
group by i.Item_Id,a.invent_Id

select * from @SummaryTable where AggregatedAttrID in (
select AggregatedAttrID from @SummaryTable group by AggregatedAttrID having count(invent_Id) > 1)

invent_Id AggregatedAttrID
----------- -----------------------------------------------------
1692 22.585706525424406
1695 22.585706525424406
3355 141.62583558245848
3356 141.62583558245848

(4 row(s) affected)

Thanks!

RyanRandall
Master Smack Fu Yak Hacker

1074 Posts

Posted - 2006-04-11 : 10:26:59

Hi Tim,

The idea of "checksum_agg(tmplat_Id) + sum(log(tmplat_Id + 0.5)" is just to quickly get some summarising number for each list of numbers. That's then obviously much easier to work with.

The ideal method is to come up with something which will never give the same answer for any 2 different lists, and this is just my attempt - off the top of my head. I feel like it's something someone will have investigated at some point, but couldn't find anything from a quick google search.

It's hard to help you without seeing a problem. Try to find or construct some data which breaks what you've got, and post that - it will then be much easier to help...

Ryan Randall
www.monsoonmalabar.com London-based IT consultancy

Solutions are easy. Understanding the problem, now, that's the hard part.

SQLTim
Starting Member

21 Posts

Posted - 2006-04-11 : 10:53:33

Hi Ryan,

The post just above your last one does what you ask.
It's a new data set that is pretty close to what I'm working with yet it keeps it simple.

I think your idea would work well here because it would be working on duplicates within each item, not over all items.
Because the set is small, the chances of random duplication is very small.

In the above example there are two items really: 3549 and 12545. 3549 has 4 sub items and 12545 has 5.
Each of those sub items represent a color size combination. Those are represented by two records in the i_attlnk table.
The problem is that I'm finding more than one sub item that has the same attribute pair.

I've modified your solution incorrectly to adapt to the new data set. I'm trying to figure out how I've gone astray.
I've bolded my major changes...

declare @SummaryTable table (Item_Id int, invent_Id int, AggregatedAttrID float)
insert @SummaryTable select i.Item_Id, a.invent_Id, checksum_agg(tmplat_Id) + sum(log(tmplat_Id + 0.5))
from @i_attlnk a
join @i_items i on i.invent_Id = a.invent_Id
group by i.Item_Id,a.invent_Id

select * from @SummaryTable where AggregatedAttrID in (
select AggregatedAttrID from @SummaryTable group by Item_Id, AggregatedAttrID having count(invent_Id) > 1)

yeilding:
Item_Id invent_Id AggregatedAttrID
----------- ----------- -----------------------------------------------------
3549 1692 22.585706525424406
3549 1695 22.585706525424406
12454 3355 141.62583558245848
12454 3356 141.62583558245848

Which is what I would be looking for. But it's still not quite right.
How would I handle:
count(invent_Id) > 1) ?

Thank you for your help in this!

Tim

SQLTim
Starting Member

21 Posts

Posted - 2006-04-14 : 14:50:56

I've got most of the answer!
Now I'm missing something I fear is basic.

--data

declare @i_items table
(
Item_Id int,
Invent_Id int
)

insert into @i_items
select 12454, 3354 union all
select 12454, 3355 union all
select 12454, 3356 union all
select 12454, 3357 union all
select 12454, 6481 union all
select 3549 , 1692 union all
select 3549 , 1693 union all
select 3549 , 1694 union all
select 3549 , 1695

declare @i_attlnk table
(
tmplat_Id int,
invent_Id int
)

insert into @i_attlnk
select 6722, 3354 union all
select 6721, 3354 union all
select 6722, 3355 union all
select 6718, 3355 union all
select 6722, 3356 union all
select 6718, 3356 union all
select 6722, 3357 union all
select 6720, 3357 union all
select 7608, 6481 union all
select 6722, 6481 union all
select 6588, 1692 union all
select 6585, 1692 union all
select 6588, 1693 union all
select 6586, 1693 union all
select 6588, 1694 union all
select 6587, 1694 union all
select 6588, 1695 union all
select 6585, 1695

Select Item_Id, AttName, count(*)
from (
select Item_Id, a.invent_Id, CAST(Max(tmplat_Id) as char(10) )+ CAST(Min(tmplat_Id) as char(10)) as AttName
from @i_attlnk a
join @i_items i on i.invent_Id = a.invent_Id
group by i.Item_Id,a.invent_Id
) as q
group by Item_Id, AttName
having count(*)>1

Item_Id     AttName              Count                  
----------- -------------------- ----------- 
3549        6588      6585       2
12454       6722      6718       2

This is great!
Now how would I get a listing of all of these items with their invent_id included?

Ideally I'm looking for:
Item_ID, invent_Id

that in this case should yeild 4 records.

Any ideas?

Many Thanks in advance!

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources