Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 Site Related Forums
 The Yak Corral
 Reader Challenge

Author  Topic 

SamC
White Water Yakist

3467 Posts

Posted - 2005-06-24 : 10:57:39
This is an application of Arnold's string distance algorithm, but it's tough...

I have two tables... TableA, TableB. Both have Firstname, Middlename, Lastname columns and are *unfortunately*, the only key which can be used to match the two tables. Spelling errors.

Write a query that will return a recordset of exact matches, and include records without an exact match but have a minimal distance (2 or 3) for those that rows do not match.

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2005-06-24 : 11:10:30
How about just use the function Arnold provided with a cross join?


select a.FirstName [AFirst]
,a.middleName [AMiddle]
,a.LastName [ALast]
,b.FirstName [BFirst]
,b.middleName [BMiddle]
,b.LastName [BLast]
from tableA a
cross join tableB b
where dbo.edit_distance(
a.FirstName+a.middleName+a.LastName,
b.FirstName+b.middleName+b.LastName) < 4


Be One with the Optimizer
TG
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2005-06-24 : 12:54:48
quote:
Originally posted by TG

How about just use the function Arnold provided with a cross join?

Won't it match a Row from TableA to more than one row in TableB?

I didn't state the problem well. It's one I've been playing with this week. TableA and TableB should have unique matches, but there are misspellings. (My solution is the misspellings should be corrected., but it's interesting to toy with a distance formula to see how close you can get without correcting the misspellings.)

The problem statement could be refined to something like: Match each row in TableA to a single row in TableB with the minimum distance.

e.g. one match, distance of 0, then 1, then 2 until there is a match.

If two rows match at a distance level (e.g. two matches at 1), it's no match.
Go to Top of Page

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2005-06-24 : 15:10:45
then i would follow TG's approach... more or less


select a.FirstName [AFirst]
,a.middleName [AMiddle]
,a.LastName [ALast]
,b.FirstName [BFirst]
,b.middleName [BMiddle]
,b.LastName [BLast]
,dbo.edit_distance(
a.FirstName+a.middleName+a.LastName,
b.FirstName+b.middleName+b.LastName)
[distance]
Into #resultsPrep
from tableA a
cross join tableB b
where dbo.edit_distance(
a.FirstName+a.middleName+a.LastName,
b.FirstName+b.middleName+b.LastName) < 4


Select Y.*
From
(
Select
AFirst, AMiddle, ALast, distance=min(distance)
From #resultsPrep s1
Group By AFirst, AMiddle, ALast
Having (Select count(*) From #resultsPrep Where AFirst=s1.AFirst and AMiddle=s1.AMiddle and ALast=s1.ALast and distance=min(s1.distance))=1
) Z
Inner Join #resultsPrep Y
On Z.AFirst=Y.AFirst
and Z.AMiddle=Y.AMiddle
and Z.ALast=Y.ALast
and Z.distance=Y.distance


Drop Table #resultsPrep



Corey

Co-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now."
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2005-06-24 : 16:51:04
I wonder if the distance would give better results if applied individually to First, Middle, Last name.

Sam
Go to Top of Page

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2005-06-24 : 17:40:21
I don't think it matters... because of the algorithm

Corey

Co-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now."
Go to Top of Page
   

- Advertisement -