Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 SQL Server Development (2000)
 Partial pattern matching

Author  Topic 

ashok
Yak Posting Veteran

57 Posts

Posted - 2003-06-05 : 07:44:21
I have a large table (2 million+ records...) with two columns:
phrase and control_no ; a control_no can have multiple phrases,
some sample data is shown below:

phrase Control_No
---------- -----------------
Africa Growth and Opportunity Act A03324
Africa Growth and Opportunity Bill A02978
Constitutional Amendments A02921
Constitutional amendments A02823
Constitution Amendments A02777

In the sample data above:
"Africa Growth and Opportunity Bill" is actually a mistake in the
data, when it should have been
"Africa Growth and Opportunity Act".

My aim is to automate detection (or at least report on a possible
error) of such mistakes.

Currently I have a python script that performs a variety of checks
on the data before sending it to a staging database, where I do further relational checks, before sending the checked
data to a final live db. All this is working very well, except
for the kind of checking I just mentioned.

My question now being, is there a way to produce a report on
possible erroneous data based on partial matches ?

for. eg. if I had a query with a like condition as
'Africa Growth and %'
it would come up with the correct and wrong phrases.

However, my problem is that it is not manually possible to locate such partial phrases within the data.
Is there a way to automate the creation of a type of "best-fit" partial phrase list on the data using SQL ? I could probably then use this list to produce a list of the actual erroneous phrases.....

Along with the erroneous data, there is also a lot of correct non-repeated phrase data.

thanks for reading a long post ! hope someone can point me in the right direction !

--ashok

X002548
Not Just a Number

15586 Posts

Posted - 2003-06-05 : 09:15:44
First, the problem could have been avoided. You need a parental Phrase table that contains a list of all valid phrases. Then implement RI, and you would alway get an error when an "illegal" phrase was about to be inserted or updated...

But now that the Cats out of the bag, my question to you is, how dop you know, out of 2 million rows, which phrases are valid?

And who makes that "business" decision?



Brett

8-)
Go to Top of Page

ashok
Yak Posting Veteran

57 Posts

Posted - 2003-06-05 : 09:38:12
The basic info. is being picked up from a library/citation database system,which uses a flat-file kinda format. Something like a very-wide (62 column table...). Using the python script i have transformed this
into 4 relational tables...

What you mention about the parental phrase table is precisely what I am trying to create in an automated manner; by creating a list of "highly probable invalid-phrases".... and then using this list against the main phrase table to arrive at a set of correct phrases.

quote:

You need a parental Phrase table that contains a list of all valid phrases.



Ultimately this will be decided manually (my client does not have a manpower problem!) based on the "highly probable list" i just mentioned. Basically they will use this report to correct the data in their main citation database...

Right now I am grouping terms together using the soundex() function,
for a sort of crude high level grouping. Just thought there might be a better way!

quote:

how do you know, out of 2 million rows, which phrases are valid?



--ashok
Go to Top of Page

X002548
Not Just a Number

15586 Posts

Posted - 2003-06-05 : 10:48:30
quote:

Right now I am grouping terms together using the soundex() function,
for a sort of crude high level grouping. Just thought there might be a better way!



What does SOUNDEX give you? The results as listed in your first post, or something different?

Also how many suspected distinct values (best guess) do you think you have? Is it managable? I wouldn't think so with 2 million rows.



Brett

8-)
Go to Top of Page
   

- Advertisement -