Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
strongweb
Starting Member
12 Posts |
Posted - 2002-11-13 : 12:07:20
|
| Hi, got a question 4 the sql gurus.I am currently in the process of performing a "data cleansing" operation. Basically ensuring my customer's address fields are correct. This involves comparing the address fields with the database of poscode provided by the National Postal Service. My question is this :-1) How can i generate a TSQL script to recognize that :- 12, George St equals 12, George StreetThere is hundreds of possiblities, not just for St/Street so a rule based table would not work. Furthermore, i have 2 million adresses to validate against 40,000 postcodes. |
|
|
MichaelP
Jedi Yak
2489 Posts |
Posted - 2002-11-13 : 12:22:30
|
| This is definatly a WAG, but it's possible SOUNDEX() may help you on "fuzzy" comparisons. Look it up in the BOL.Michael<Yoda>Use the Search page you must. Find the answer you will.</Yoda> |
 |
|
|
strongweb
Starting Member
12 Posts |
Posted - 2002-11-13 : 12:51:51
|
| cant use SOUNDEX, as it has little use for languages other than English :).i'm looking at a solution other than SQL7 now..maybe some exotic perl routines that do fuzzy matching ..any ideas? |
 |
|
|
M.E.
Aged Yak Warrior
539 Posts |
Posted - 2002-11-13 : 12:52:40
|
| I don't think soundex() will work for st vs street though. st does quite sound like street.Ufortunately there is no easy way around this one (kinda wish someone would jump up and say there was :D). one option is to make a replacement table that contains all (I mean ALL) replacement strings.table example:value, replacementst,streetst.,streetave, avenueav, avenueYou'll need to find every last abr. used in your table.You would then replace any ' ' + value + ' ' in the table with ' ' + replacement + ' 'The spaces are important so the word 'string' isn't replaced as 'Streetring'. I will only replace occourences of '_st_' (underscore being spaces). Unforatunely dirty data like this is not easy to clean at all. You will need probably need to use patindex to locate the occourences of st.-----------------------SQL isn't just a hobby, It's an addiction |
 |
|
|
AndrewMurphy
Master Smack Fu Yak Hacker
2916 Posts |
Posted - 2002-11-14 : 06:32:30
|
| something similiar was required here.....http://www.sqlteam.com/Forums/topic.asp?TOPIC_ID=17329Permanently living with non-standard data isn't wise....good advice is to try to convert all variations of a word to 1 standard form (be that an abbreviation or long-version)....once you have 'auto-cleaned' the simple stuff....the amount of poor-quality stuff should drastically fall-away....and you may be able to develop rules to handle what's left over.... |
 |
|
|
|
|
|
|
|