Author |
Topic |
ahouse
Starting Member
27 Posts |
Posted - 2009-01-16 : 17:33:42
|
Hi all,This is a more fundamental than syntactical question. I have this table of 14 million correctly spelled business names.I have another tableof 2000 business names that may not be spelled exactly the same but are similiar.I'm trying to find a way to do some type of advanced compare of the text fields to try and generate hits similar to what a web search engine would do. i.e. Best possible match.My initial thoughts were to do something like this:Start by doing a straight compare of the fields to find any perfect matches and mark them.Take the unmarked and try to compare the first 15 characters against the first 15 characters of the large file and mark them.And so on and so forth all the way down to the first 3 characters or something. I'm not sure if this is the best way to go about this. Does anyone know of any script/algorithms that would yield better results for me?Any thoughts would be greatly appreciated.Thanks everyone! |
|
sodeep
Master Smack Fu Yak Hacker
7174 Posts |
Posted - 2009-01-16 : 18:16:11
|
Look at this:http://sql-server-performance.com/Community/forums/p/20600/114956.aspx |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
|
ahouse
Starting Member
27 Posts |
Posted - 2009-01-19 : 15:37:28
|
Great articles from everyone. Thank you. visakh16 - I'm having trouble wrapping my hands around this article: [urlhttp://sqlblindman.googlepages.com/fuzzysearchalgorithm[/url]I think the CompareText() function is great but I'm a little unsure how to use it properly. To begin, I was going to strip vowels and symbols and create additional fields in both my dataset(NewFile) as well as the 14million name db table(BizDB). Once I have those, I can then perform my CompareText() function.Logically, will this work and is there a more efficient solution that I am overlooking? This is what I have envisioned in my head. Insert Into ResultsTable(BusinessName,Zip,Phone,Certainty)Select BizDB.BusinessName,BizDB.Zip,BizDB.Phone, '80'From BizDBInner Join NewFileON BizDB.Zip = NewFile.ZipAND CompareText(BizDB.BusinessName, NewFile.BusinessName) >= 80 Any ideas?As always, your help and knowledge is greatly appreciated.Thanks! |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2009-01-20 : 03:02:31
|
yup...it looks fine. why did you get any error? also you should be using dbo.CompareText(..) rather than simply CompareText(..) |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2009-02-02 : 11:39:25
|
Hi ahouse.I see you are using my fuzzy search function. Sorry I missed this thread earlier. Let me know if you have any questions about it.blindman________________________________________________If it is not practically useful, then it is practically useless.________________________________________________ |
|
|
ahouse
Starting Member
27 Posts |
Posted - 2009-02-03 : 16:53:35
|
Blindman,Thanks for the post!The CompareText function seems to be working exactly how it is intended to without using the MatchText function. I'm a little hesitant to use the MatchText function. It sounds like a good idea in theory but the results I was getting seemed far more accurate just using the straight CompareText function.I agree that 80 or above is a good indication of a match. I do have one question that might help me out:Now that I have been able to get it working the way I want to, my next step is, of course, to optimize the stored procedure so it will run quickly. As I stated in my previous post, I am joining based on Zip AND CompareText >= 80. Right now I have both tables Indexed (Clustered) for the Zip field. Is there any added benefit for Indexing or Full Text Searching on the actual Business Name field? The logic for my stored procedure is the following:1.Create some variables2.Perform a select Into #BusinessName_DataSet joining on Zip Code3.Perform an Insert Into BusinessName_Results using CompareText against the #BusinessName_DataSet table4.Dropping #BusinessName_DataSet table 5.Move on to the next BusinessName to check (Set @NewAccount = Select(Min(UniqueID)6.Loop back to the beginning until all UniqueID's have been processedI am relatively new to SQL and I have read plenty of articles but actually having to implement something is a whole new ball game. Any tips would be most helpful.Thanks again for the post!AHouse |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2009-02-04 : 14:08:53
|
There would be no benefit to the CompareText function from indexing the BusinessName field. With all the parsing it does, indexes would be tossed out the window.I would think you could avoid using a loop or a cursor for this though. I'd be surprised if a set-based operation wouldn't ultimately be faster.________________________________________________If it is not practically useful, then it is practically useless.________________________________________________ |
|
|
ahouse
Starting Member
27 Posts |
Posted - 2009-02-04 : 16:11:17
|
Ok I didn't think a Business Naming Index would do me any good. Thanks for confirming. As for avoiding the looping, I think it would be much faster if I could do without but I am having trouble logically doing the join. I took the CompareText a step further and am trying to match on a set of zip codes instead of just one. Let me attempt to explain:I have my NewBusiness table which contains the raw business names, a zip code, and a pre-process that allows me to find the Latitude and Longitude coordinates for that zip code. This also contains a field called Filler where value always equals 1I have my ZipCodeMaster table which contains all zip codes and corresponding latitude and longitude coordinates. This also contains a field called Filler where value always equals 1I have my BusinessMaster table which contains the Business Name and corresponding Zip CodeSo my first step is to perform my CalculateDistance function to find all Zip Codes within a range of XX miles based on the Latitude and Longitude coordinates. This piece I can do (using the filler as my Join On field)Select ZipCodeMaster.ZipCodeFrom ZipCodeMaster Join NewBusinessOn ZipCodeMaster.Filler = NewBusiness.FillerWhere dbo.CalculateDistance(NewBusiness.Latitude, NewBusiness.Longitude, ZipCodeMaster.Latitude, ZipCodeMaster.Longitude) < '35'Then based on this dataset of possible zip codes, I need to join the BusinessMaster table, do my comparetext() function and insert the results to a table (somehow):Select BusinessMaster.BusinessNameFrom BusinessMaster Join "possible zip code dataset"On BusinessMaster.ZipCode = "possible zip code dataset".ZipCodeWhere dbo.CompareText(BusinessMaster.BusinessName, NewBusiness.BusinessName) >= '80'I can't comprehend the possibility of combining these two Select statements to make one statement that would allow me to avoid looping. Firstly, does any of that rambling make sense? Secondly, can you enlighten me with some amazing statement that would solve the problem? |
|
|
|