Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
ashok
Yak Posting Veteran
57 Posts |
Posted - 2003-06-05 : 07:44:21
|
| I have a large table (2 million+ records...) with two columns:phrase and control_no ; a control_no can have multiple phrases,some sample data is shown below:phrase Control_No---------- -----------------Africa Growth and Opportunity Act A03324Africa Growth and Opportunity Bill A02978Constitutional Amendments A02921Constitutional amendments A02823Constitution Amendments A02777In the sample data above:"Africa Growth and Opportunity Bill" is actually a mistake in thedata, when it should have been "Africa Growth and Opportunity Act".My aim is to automate detection (or at least report on a possible error) of such mistakes.Currently I have a python script that performs a variety of checkson the data before sending it to a staging database, where I do further relational checks, before sending the checkeddata to a final live db. All this is working very well, exceptfor the kind of checking I just mentioned.My question now being, is there a way to produce a report onpossible erroneous data based on partial matches ?for. eg. if I had a query with a like condition as'Africa Growth and %'it would come up with the correct and wrong phrases.However, my problem is that it is not manually possible to locate such partial phrases within the data.Is there a way to automate the creation of a type of "best-fit" partial phrase list on the data using SQL ? I could probably then use this list to produce a list of the actual erroneous phrases.....Along with the erroneous data, there is also a lot of correct non-repeated phrase data.thanks for reading a long post ! hope someone can point me in the right direction !--ashok |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2003-06-05 : 09:15:44
|
| First, the problem could have been avoided. You need a parental Phrase table that contains a list of all valid phrases. Then implement RI, and you would alway get an error when an "illegal" phrase was about to be inserted or updated...But now that the Cats out of the bag, my question to you is, how dop you know, out of 2 million rows, which phrases are valid?And who makes that "business" decision?Brett8-) |
 |
|
|
ashok
Yak Posting Veteran
57 Posts |
Posted - 2003-06-05 : 09:38:12
|
The basic info. is being picked up from a library/citation database system,which uses a flat-file kinda format. Something like a very-wide (62 column table...). Using the python script i have transformed this into 4 relational tables...What you mention about the parental phrase table is precisely what I am trying to create in an automated manner; by creating a list of "highly probable invalid-phrases".... and then using this list against the main phrase table to arrive at a set of correct phrases.quote: You need a parental Phrase table that contains a list of all valid phrases.
Ultimately this will be decided manually (my client does not have a manpower problem!) based on the "highly probable list" i just mentioned. Basically they will use this report to correct the data in their main citation database...Right now I am grouping terms together using the soundex() function,for a sort of crude high level grouping. Just thought there might be a better way!quote: how do you know, out of 2 million rows, which phrases are valid?
--ashok |
 |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2003-06-05 : 10:48:30
|
quote: Right now I am grouping terms together using the soundex() function,for a sort of crude high level grouping. Just thought there might be a better way!
What does SOUNDEX give you? The results as listed in your first post, or something different?Also how many suspected distinct values (best guess) do you think you have? Is it managable? I wouldn't think so with 2 million rows.Brett8-) |
 |
|
|
|
|
|
|
|