Author |
Topic |
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2005-06-24 : 09:52:40
|
See here www.merriampark.com/ld.htm for information about the algorithm. This page has a link ([url]http://www.merriampark.com/ldtsql.htm[/url]) to a T-SQL implementation by Joseph Gama: unfortunately, that function doesn't work. There is a debugged version in the also-referenced package of TSQL functions ([url]http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=502&lngWId=5[/url]), but this still has the fundamental problem that it only works on pairs of strings up to 49 characters.CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))RETURNS intASBEGIN DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int, @cv0 varbinary(8000), @cv1 varbinary(8000) SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0 WHILE @j <= @s2_len SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1 WHILE @i <= @s1_len BEGIN SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1 WHILE @j <= @s2_len BEGIN SET @c = @c + 1 SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) + CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END IF @c > @c_temp SET @c = @c_temp SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1 IF @c > @c_temp SET @c = @c_temp SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1 END SELECT @cv1 = @cv0, @i = @i + 1 END RETURN @cEND |
|
Seventhnight
Master Smack Fu Yak Hacker
2878 Posts |
Posted - 2005-06-24 : 11:13:52
|
speed :Select admin.dbo.edit_distance(replicate('abcdefghij',399),replicate('abcdefghij',399)) took 1:19 to run and return 0Select admin.dbo.edit_distance(replicate('abcdefghij',399),replicate('zabcdefghi',399))took 1:14 to run and return 400 (correct)CoreyCo-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now." |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2005-06-24 : 11:51:42
|
I didn't say it was fast for big strings, just that it worked! Actually, the reason I posted it this week was that the old version used a temporary table (i,j,cost) and that was ludicrously slow even on short strings, so I rewrote it.Any ideas on how to make it faster, though, would be handy. Though I don't think any simple changes are going to make it asymptotically better than O(m*n) |
|
|
Seventhnight
Master Smack Fu Yak Hacker
2878 Posts |
Posted - 2005-06-24 : 11:53:03
|
i'm thinking CoreyCo-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now." |
|
|
Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)
7020 Posts |
Posted - 2005-06-24 : 12:09:41
|
Anyone have an example of an application you would use this for?CODO ERGO SUM |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2005-06-24 : 12:17:24
|
In my case, I'd been supplied with a table of 14500 GPs (primary care doctors). About 1200 of those, the name in that table didn't match the name in the national table (about 45000 rows) (joining on national GP id). Sorting the mismatches by edit distance I could quickly see what sort of differences there were -- most of them were one or two character differences. |
|
|
SamC
White Water Yakist
3467 Posts |
Posted - 2005-06-24 : 14:07:22
|
quote: Originally posted by Michael Valentine Jones Anyone have an example of an application you would use this for?
You may have already seen the reader challenge I posted using this function.I'm not finding many takers... |
|
|
TG
Master Smack Fu Yak Hacker
6065 Posts |
Posted - 2005-06-24 : 14:57:42
|
>>I'm not finding many takers...My big sister used to give me similar "challenges". ie: I bet you can't wash all those dishes before I'm done watching this TV show.Be One with the OptimizerTG |
|
|
spirit1
Cybernetic Yak Master
11752 Posts |
Posted - 2005-06-24 : 18:09:01
|
well we use it for our CRM...when adding a new organization or person you can search using distance check or not.although it's implemented in asp not sql.Go with the flow & have fun! Else fight the flow |
|
|
Kristen
Test
22859 Posts |
Posted - 2005-06-24 : 23:53:39
|
"I bet you can't wash all those dishes before I'm done watching this TV show"Shucks, I wish I'd had a big sister ... Kristen |
|
|
elwoos
Master Smack Fu Yak Hacker
2052 Posts |
Posted - 2005-06-27 : 04:16:44
|
Arnold, you could be a life saver. I am going to have to do something similar to you sooncheerssteve (Possibly only the second person on this board that can pronounce Machynlleth) |
|
|
SamC
White Water Yakist
3467 Posts |
Posted - 2005-06-27 : 09:54:36
|
quote: Originally posted by TG My big sister used to give me similar "challenges". ie: I bet you can't wash all those dishes before I'm done watching this TV show.
Hey! I resemble that remark!... I mentioned in my post that it seemed better to correct the mistakes rather than pursue approximations (in our case anyway). |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2005-06-27 : 10:26:09
|
quote: Originally posted by elwoos Arnold, you could be a life saver. I am going to have to do something similar to you soon
Have fun! Hope it's easier than the time I tried to write something to separate the initials and surname in the "Organisation Name" column of EGPCUR! Love to know what idiot at NACS thought it would be a good idea to give the tables for GPs, practices, PCTs and health authorities the same schema. And don't get me started on GPs who work in multiple practices... |
|
|
SamC
White Water Yakist
3467 Posts |
Posted - 2005-06-27 : 11:03:11
|
Couldn't help notice you use SELECT instead of SET. BOL makes a issue about there being some benefit using SET rather than SELECT for non-table based (scalar?) calculations.Anyone know why? I'm guessing it's a compile time optimization with no runtime benefit...Sam |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2005-06-27 : 12:36:28
|
quote: Originally posted by SamC Couldn't help notice you use SELECT instead of SET. BOL makes a issue about there being some benefit using SET rather than SELECT for non-table based (scalar?) calculations.
Does it? Oh, I assumed it could spot that there was no actual selecting involved -- after all, if there were, it wouldn't allow the function to be defined! -- and that it would get interpreted the same way as SET. It's quite possible it does make a difference, though. I'll give it a go. I only used SELECT because I had a load of variables that needed setting and it was getting visually noisy having all those SETs. |
|
|
elwoos
Master Smack Fu Yak Hacker
2052 Posts |
Posted - 2005-06-28 : 03:39:08
|
Arnold, you mean we can't just ignore them - maybe that's where I've been going wrong . Anyway the GP's are the 'good' guys. What about the consultants! Damn GMC needs to get its act togetherThe functions I wrote [url]http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=50044[/url] (albeit with errors!) were partially for handling this little lotAlright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer. |
|
|
rrb
SQLTeam Poet Laureate
1479 Posts |
Posted - 2005-06-28 : 03:57:23
|
Mr Fribble, if I haven't already said it a million times... You ROCK!That's extra cool - more cool than just normally cool!Thanks--I hope that when I die someone will say of me "That guy sure owed me a lot of money" |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2005-06-28 : 03:57:42
|
quote: Anyway the GP's are the 'good' guys.
Oh, sorry, it wasn't the GPs themselves I was complaining about, just NACS mishandling of the database where a GP works in multiple practices (or at least, multiple entities represented in the EPRACCUR table). One of my colleagues asked NACS about this and they said that they create a new national GP code (starting with G6 or something) for the GP working in the 'other' practice.Certainly if you look at something like this:SELECT A.*FROM EGPCUR AS AINNER JOIN ( SELECT "Organisation Name", LEFT(Postcode, 2) AS pc FROM EGPCUR GROUP BY "Organisation Name", LEFT(Postcode, 2) HAVING COUNT(*) > 1 ) AS B ON A."Organisation Name" = B."Organisation Name" AND LEFT(A.Postcode, 2) = pcORDER BY A."Organisation Name", A.Postcode you see (apart from many rows in EGPCUR that don't represent a single person!) quite a number of occurrences of this sort of thing. |
|
|
SamC
White Water Yakist
3467 Posts |
Posted - 2005-06-28 : 07:17:30
|
Is it a consensus then that if 3 fields exists: Firstname, Middlename, Lastname, that the a distance on (Firstname + ' ' + Middlename + ' ' + Lastname) will be as accurate (and faster) than 3 separate distance calculations on the individual columns? |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2005-06-28 : 07:56:26
|
quote: Originally posted by SamC Is it a consensus then that if 3 fields exists: Firstname, Middlename, Lastname, that the a distance on (Firstname + ' ' + Middlename + ' ' + Lastname) will be as accurate (and faster) than 3 separate distance calculations on the individual columns?
If it is faster to combine the columns, then it's only because of the call overhead. The function takes O(nm) time (where n and m are the lengths of the strings)*, so you're comparing m1n1 + m2n2 + m3n3 to (m1+m2+m3)(n1+n2+n3)But "as accurate" depends on the data, and what you're trying to compare.* Probably worse in this implementation, since it's reassigning the costs in the cost vectors. |
|
|
SamC
White Water Yakist
3467 Posts |
Posted - 2005-06-28 : 08:24:39
|
I could experiment, but let me ask anyway. Suppose your data largely had errors in the middlename being present (or not), and the Firstname, Lastname were (usually) a match.Would accuracy suffer in the comparison of (m1+m2+m3) versus comparing m1, m2 and m3 individually?The query will be much simpler in the former, but it seems to me it would be no less accurate than the latter. |
|
|
Next Page
|