Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Record linkage - Advice required.

Author  Topic 

Abu-Dina
Posting Yak Master

206 Posts

Posted - 2012-07-30 : 14:22:09
Hi,

I have the following table structure:

create table dbo.PreMerge (
master_id int,
Title varchar(20),
Forename varchar(50),
Surname varchar(50),
CompanyName varchar(200),
CompanyName2 varchar(200),
Address1 varchar(200),
Address2 varchar(200),
Address3 varchar(200),
Address4 varchar(200),
Town varchar(100),
County varchar(100),
Postcode varchar(20),
Hierarchy int
)

insert into PreMerge (master_id, Title, Forename, Surname, CompanyName, Address1, Address2, Town, Postcode, Hierarchy)
select 1547300, 'Mr', 'E', 'Glowver', NULL, '21 Harcourt Crescent', 'Nuthall', 'Nottingham', 'NG14 6BT', '20100309' union all
select 1581378, 'Mr', 'C', 'Wallis', NULL, '116 Lynch Gate', NULL, 'Lydbury North', 'SY7 8AE', '20100309' union all
select 43196681, 'Mrs', 'X', 'Collows', NULL, '14 Ferndown Road', 'Eltham', 'London', 'SE4 5JP', '20120501' union all
select 43198564, 'Mrs', 'Z', 'Brogdenson', NULL, '36 Rhodes Park', NULL, 'North Berwick', 'EH11 5QA', '20120501' union all
select 1581378, 'Mr', 'Y', 'Walls', NULL, '117 Lynch Gate', NULL, 'Lydbury South', 'SX7 8OE', '20100309' union all
select 1547300, 'Mr', 'P', 'Glowver', NULL, '21 Harcourt', 'Nuthall', 'Notts', 'NG14 6BT', '20100309'



This is a very small sample of the data that I'm going to be working with but the table columns will always be the same.

Assumptions:

1) Data hygiene will be performed by a separate process.
2) Addresses are all valid - No record will exist without a UK postcode

The method that had been previously developed as follows:

The PreMerge table is joined to itself with the left table A being the table with record we want to retain and the right table B will be the record we will want to remove as duplicate

so it will be something like this

select a.master_id as retained_id, b.master_id as Duplicate_id,
a.title as retained_title, b.title as dropped_title
a.forename as retained_forename, b.forename as dropped_forename
.
.
.
.
from PreMerge as a
inner join Premerge as b
on left(a.Surname,5)=left(b.Surname,5)
and left(a.forename,1) = left(B.forename,1)
and a.postcode=b.postcode
where ((a.hierarchy<b.hierarchy) or (a.hierarchy=b.hierarchy and a.MergeID>b.MergeID))
and left(A.companynameny,8) = left(B.companyname,8)
and left(A.address1,8) = left(B.address1,8)
and a.postcode is not null and b.postcode is not null
and a.address1 is not null and b.address1 is not null
and a.forename is not null and b.forename is not null
and a.companyname is not null and b.companyname is not null


This is what is called Pass 1. Now we have various other passes that get applied with each pass having a different merge criteria for example we could look for:

Pass 2:

• left(a.Surname,5)=left(b.Surname,5)
• left(a.forename,1) = left(B.forename,1)
• a.postcode=b.postcode
• left(A.address1,12) = left(B.address1,12)

Pass 3:

• left(a.Surname,5)=left(b.Surname,5)
• left(a.forename,1) = left(b.forename,1)
• a.postcode=b.postcode
• left(A.companyname,14) = left(B.companyname,14)

Pass 4:

• left(a.Surname,5)=left(b.Surname,5)
• a.title = b.title
• a.postcode=b.postcode
• left(A.companyname,12) =left(B.companyname,12)

There are more than 20 passes that I have identified within the hundreds of lines of code!
But almost all of the passes use a deterministic approach rather than a probabilistic one.

So you're wondering what I'm after?! Well I want some opinions about our current technique. I could list all the logic for the passes but I just want to know if I should scrap our technique and apply a probabilistic methodology instead?

I've read several articles from the web but I'm none the wiser! I could do with some expert advice please!

Thanks for taking the time to read this!

yosiasz
Master Smack Fu Yak Hacker

1635 Posts

Posted - 2012-07-30 : 15:14:34
Not sure if this would help you , instead of passes could you use a batch process that does proximity sort of algorithm . the below links are for something different but maybe the principle is the same?

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=104431&SearchTerms=proximity

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=173550&SearchTerms=proximity

you could compare each data point and see which one is 'closer' to the other and score them.

<><><><><><><><><><><><><><><><><>
If you don't have the passion to help people, you have no passion
Go to Top of Page

Abu-Dina
Posting Yak Master

206 Posts

Posted - 2012-07-31 : 03:56:22
Thanks yosiasz.
Go to Top of Page
   

- Advertisement -