Advanced Text Comparison Project

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

Advanced Text Comparison Project

Author

Topic

ahouse
Starting Member

27 Posts

Posted - 2009-01-16 : 17:33:42

Hi all,
This is a more fundamental than syntactical question.

I have this table of 14 million correctly spelled business names.
I have another tableof 2000 business names that may not be spelled exactly the same but are similiar.

I'm trying to find a way to do some type of advanced compare of the text fields to try and generate hits similar to what a web search engine would do. i.e. Best possible match.

My initial thoughts were to do something like this:

Start by doing a straight compare of the fields to find any perfect matches and mark them.

Take the unmarked and try to compare the first 15 characters against the first 15 characters of the large file and mark them.

And so on and so forth all the way down to the first 3 characters or something.

I'm not sure if this is the best way to go about this. Does anyone know of any script/algorithms that would yield better results for me?

Any thoughts would be greatly appreciated.

Thanks everyone!

sodeep
Master Smack Fu Yak Hacker

7174 Posts

Posted - 2009-01-16 : 18:16:11

Look at this:
http://sql-server-performance.com/Community/forums/p/20600/114956.aspx

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2009-01-16 : 23:54:33

seems like what you need is SOUNDEX() or DIFFERENCE() functions

http://doc.ddart.net/mssql/sql70/setu-sus_8.htm

http://doc.ddart.net/mssql/sql70/de-dz_7.htm

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2009-01-17 : 00:13:05

also this

http://sqlblindman.googlepages.com/fuzzysearchalgorithm

ahouse
Starting Member

27 Posts

Posted - 2009-01-19 : 15:37:28

Great articles from everyone. Thank you.

visakh16 - I'm having trouble wrapping my hands around this article: [urlhttp://sqlblindman.googlepages.com/fuzzysearchalgorithm[/url]

I think the CompareText() function is great but I'm a little unsure how to use it properly.

To begin, I was going to strip vowels and symbols and create additional fields in both my dataset(NewFile) as well as the 14million name db table(BizDB). Once I have those, I can then perform my CompareText() function.

Logically, will this work and is there a more efficient solution that I am overlooking? This is what I have envisioned in my head.


Insert Into ResultsTable(BusinessName,Zip,Phone,Certainty)
Select BizDB.BusinessName,BizDB.Zip,BizDB.Phone, '80'
From BizDB
Inner Join NewFile
ON BizDB.Zip = NewFile.Zip
AND CompareText(BizDB.BusinessName, NewFile.BusinessName) >= 80

Any ideas?

As always, your help and knowledge is greatly appreciated.

Thanks!

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2009-01-20 : 03:02:31

yup...it looks fine. why did you get any error? also you should be using dbo.CompareText(..) rather than simply CompareText(..)

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2009-02-02 : 11:39:25

Hi ahouse.
I see you are using my fuzzy search function. Sorry I missed this thread earlier. Let me know if you have any questions about it.

blindman

________________________________________________
If it is not practically useful, then it is practically useless.
________________________________________________

ahouse
Starting Member

27 Posts

Posted - 2009-02-03 : 16:53:35

Blindman,
Thanks for the post!

The CompareText function seems to be working exactly how it is intended to without using the MatchText function. I'm a little hesitant to use the MatchText function. It sounds like a good idea in theory but the results I was getting seemed far more accurate just using the straight CompareText function.

I agree that 80 or above is a good indication of a match.

I do have one question that might help me out:

Now that I have been able to get it working the way I want to, my next step is, of course, to optimize the stored procedure so it will run quickly.
As I stated in my previous post, I am joining based on Zip AND CompareText >= 80.

Right now I have both tables Indexed (Clustered) for the Zip field. Is there any added benefit for Indexing or Full Text Searching on the actual Business Name field?

The logic for my stored procedure is the following:
1.Create some variables
2.Perform a select Into #BusinessName_DataSet joining on Zip Code
3.Perform an Insert Into BusinessName_Results using CompareText against the #BusinessName_DataSet table
4.Dropping #BusinessName_DataSet table
5.Move on to the next BusinessName to check
(Set @NewAccount = Select(Min(UniqueID)
6.Loop back to the beginning until all UniqueID's have been processed

I am relatively new to SQL and I have read plenty of articles but actually having to implement something is a whole new ball game. Any tips would be most helpful.

Thanks again for the post!
AHouse

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2009-02-04 : 14:08:53

There would be no benefit to the CompareText function from indexing the BusinessName field. With all the parsing it does, indexes would be tossed out the window.
I would think you could avoid using a loop or a cursor for this though. I'd be surprised if a set-based operation wouldn't ultimately be faster.

________________________________________________
If it is not practically useful, then it is practically useless.
________________________________________________

ahouse
Starting Member

27 Posts

Posted - 2009-02-04 : 16:11:17

Ok I didn't think a Business Naming Index would do me any good. Thanks for confirming.

As for avoiding the looping, I think it would be much faster if I could do without but I am having trouble logically doing the join.

I took the CompareText a step further and am trying to match on a set of zip codes instead of just one. Let me attempt to explain:

I have my NewBusiness table which contains the raw business names, a zip code, and a pre-process that allows me to find the Latitude and Longitude coordinates for that zip code. This also contains a field called Filler where value always equals 1

I have my ZipCodeMaster table which contains all zip codes and corresponding latitude and longitude coordinates. This also contains a field called Filler where value always equals 1

I have my BusinessMaster table which contains the Business Name and corresponding Zip Code

So my first step is to perform my CalculateDistance function to find all Zip Codes within a range of XX miles based on the Latitude and Longitude coordinates. This piece I can do (using the filler as my Join On field)

Select ZipCodeMaster.ZipCode
From ZipCodeMaster Join NewBusiness
On ZipCodeMaster.Filler = NewBusiness.Filler
Where dbo.CalculateDistance(NewBusiness.Latitude, NewBusiness.Longitude, ZipCodeMaster.Latitude, ZipCodeMaster.Longitude) < '35'

Then based on this dataset of possible zip codes, I need to join the BusinessMaster table, do my comparetext() function and insert the results to a table (somehow):

Select BusinessMaster.BusinessName
From BusinessMaster Join "possible zip code dataset"
On BusinessMaster.ZipCode = "possible zip code dataset".ZipCode
Where dbo.CompareText(BusinessMaster.BusinessName, NewBusiness.BusinessName) >= '80'

I can't comprehend the possibility of combining these two Select statements to make one statement that would allow me to avoid looping.

Firstly, does any of that rambling make sense? Secondly, can you enlighten me with some amazing statement that would solve the problem?

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources