Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 Transact-SQL (2000)
 Comparing Record Data with Itself

Author  Topic 

Lopaka
Starting Member

48 Posts

Posted - 2006-05-05 : 19:44:03
I am trying to take a field and check to see if there are duplicate data. In the following script, #temp gets created and data gets inserted. There are two fields identicaly input, "Administrators Financial Sector". One with an 's' and one without. I am trying to compare the first 80% of the word(s), for example: "Administrat", "Financi", "Sect". Then compares all three words to the rest of the records and returns both records. Keep in mind this is going thru ~100K records.

Does anybody have an idea?

CREATE TABLE #TEMP(
tID INT IDENTITY(1, 1) NOT NULL ,
Field1 CHAR(50) NOT NULL)

INSERT INTO #TEMP(Field1) VALUES('Account Administrator Financial Sector')
INSERT INTO #TEMP(Field1) VALUES('Account Planner')
INSERT INTO #TEMP(Field1) VALUES('Account Support Specialist')
INSERT INTO #TEMP(Field1) VALUES('Accountant')
INSERT INTO #TEMP(Field1) VALUES('Accountant 4')
INSERT INTO #TEMP(Field1) VALUES('Accountant Attorney')
INSERT INTO #TEMP(Field1) VALUES('Accountant Cost & Management')
INSERT INTO #TEMP(Field1) VALUES('Accountant General 1 Entry')
INSERT INTO #TEMP(Field1) VALUES('Accountant Municipal')
INSERT INTO #TEMP(Field1) VALUES('Accountant Public')
INSERT INTO #TEMP(Field1) VALUES('Accounting Auditing Clerk')
INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk 4')
INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk Entry Level')
INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant')
INSERT INTO #TEMP(Field1) VALUES('Accounting Director')
INSERT INTO #TEMP(Field1) VALUES('Accounting Machine Servicer')
INSERT INTO #TEMP(Field1) VALUES('Accounting Section Supervisor')
INSERT INTO #TEMP(Field1) VALUES('Accounting Teacher')
INSERT INTO #TEMP(Field1) VALUES('Accounting Team Leader')
INSERT INTO #TEMP(Field1) VALUES('Accounting Technician')
INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Clerk')
INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Manager')
INSERT INTO #TEMP(Field1) VALUES('Administrators Financial Sector')


Robert R. Barnes

khtan
In (Som, Ni, Yak)

17689 Posts

Posted - 2006-05-05 : 21:36:32
Will this do ?


select *
from #TEMP t1 inner join #TEMP t2
on t1.Field1 like left(t2.Field1, 0.8 * len(t2.Field1)) + '%'
and t1.Field1 <> t2.Field1



KH

Go to Top of Page

Lopaka
Starting Member

48 Posts

Posted - 2006-05-06 : 00:02:18
This returns all fields with one word identical. The object would be to return on the 'Administrator Financial Sector' and 'Administrators Financial Sector'. Unfortunately, I just noticed the first has accountant in it. Need to use the following:

CREATE TABLE #TEMP(
tID INT IDENTITY(1, 1) NOT NULL ,
Field1 CHAR(50) NOT NULL)

INSERT INTO #TEMP(Field1) VALUES('Administrator Financial Sector')
INSERT INTO #TEMP(Field1) VALUES('Account Planner')
INSERT INTO #TEMP(Field1) VALUES('Account Support Specialist')
INSERT INTO #TEMP(Field1) VALUES('Accountant')
INSERT INTO #TEMP(Field1) VALUES('Accountant 4')
INSERT INTO #TEMP(Field1) VALUES('Accountant Attorney')
INSERT INTO #TEMP(Field1) VALUES('Accountant Cost & Management')
INSERT INTO #TEMP(Field1) VALUES('Accountant General 1 Entry')
INSERT INTO #TEMP(Field1) VALUES('Accountant Municipal')
INSERT INTO #TEMP(Field1) VALUES('Accountant Public')
INSERT INTO #TEMP(Field1) VALUES('Accounting Auditing Clerk')
INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk 4')
INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk Entry Level')
INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant')
INSERT INTO #TEMP(Field1) VALUES('Accounting Director')
INSERT INTO #TEMP(Field1) VALUES('Accounting Machine Servicer')
INSERT INTO #TEMP(Field1) VALUES('Accounting Section Supervisor')
INSERT INTO #TEMP(Field1) VALUES('Accounting Teacher')
INSERT INTO #TEMP(Field1) VALUES('Accounting Team Leader')
INSERT INTO #TEMP(Field1) VALUES('Accounting Technician')
INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Clerk')
INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Manager')
INSERT INTO #TEMP(Field1) VALUES('Administrators Financial Sector')


Robert R. Barnes
Go to Top of Page

RyanRandall
Master Smack Fu Yak Hacker

1074 Posts

Posted - 2006-05-08 : 05:30:00
Hi Robert,

This will work for the example you've pointed out (as I understand it), but I'm guessing it's not going to be sufficient for your whole problem. Could you post some more examples (i.e. ones this won't work for). Thanks.

select * from #TEMP a inner join #TEMP b on not a.tid = b.tid and a.Field1 = replace(b.Field1 + ' ', 's ', ' ')


Ryan Randall
www.monsoonmalabar.com London-based IT consultancy

Solutions are easy. Understanding the problem, now, that's the hard part.
Go to Top of Page

Lopaka
Starting Member

48 Posts

Posted - 2006-05-08 : 11:57:37
Your idea would work, if i were only looking for words with a trailing 's'. Unfortunately, the word could be a little more different.

I was thinking this weekend about using left 75% of each word in Field1. This would ensure a higher return on the observations. I have added a couple of more 's' words so now you get the 4 identical. I also added 'accounting Director' and 'Account Director'; using the 75% rule, this would return as a duplicate.

Also, sorry about the example data. I am not very good at making stuff up. :(

CREATE TABLE #TEMP(
tID INT IDENTITY(1, 1) NOT NULL ,
Field1 CHAR(50) NOT NULL)

INSERT INTO #TEMP(Field1) VALUES('Administrator Financial Sector')
INSERT INTO #TEMP(Field1) VALUES('Account Planner')
INSERT INTO #TEMP(Field1) VALUES('Account Support Specialist')
INSERT INTO #TEMP(Field1) VALUES('Accountant')
INSERT INTO #TEMP(Field1) VALUES('Accountant 4')
INSERT INTO #TEMP(Field1) VALUES('Accountant Attorney')
INSERT INTO #TEMP(Field1) VALUES('Accountant Cost & Management')
INSERT INTO #TEMP(Field1) VALUES('Accountant General 1 Entry')
INSERT INTO #TEMP(Field1) VALUES('Accountant Municipal')
INSERT INTO #TEMP(Field1) VALUES('Accountant Public')
INSERT INTO #TEMP(Field1) VALUES('Accounting Auditing Clerk')
INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk 4')
INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk Entry Level')
INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant')
INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant2')
INSERT INTO #TEMP(Field1) VALUES('Accounting Director')
INSERT INTO #TEMP(Field1) VALUES('Account Director')
INSERT INTO #TEMP(Field1) VALUES('Accounting Machine Servicer')
INSERT INTO #TEMP(Field1) VALUES('Accounting Section Supervisor')
INSERT INTO #TEMP(Field1) VALUES('Accounting Sections Supervisor')
INSERT INTO #TEMP(Field1) VALUES('Accounting Sections Supervisors')
INSERT INTO #TEMP(Field1) VALUES('Accounting Teacher')
INSERT INTO #TEMP(Field1) VALUES('Accounting Team Leader')
INSERT INTO #TEMP(Field1) VALUES('Accounting Technician')
INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Clerk')
INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Manager')
INSERT INTO #TEMP(Field1) VALUES('Administrators Financial Sector')


Robert R. Barnes
Go to Top of Page

RyanRandall
Master Smack Fu Yak Hacker

1074 Posts

Posted - 2006-05-08 : 12:49:56
Okay, here's the next 'stab'. I've stuck to the rough algorithm you've suggested and so you can at least get to grips with one possible technique.

Again please let us know any issues with this approach and give examples of where it doesn't give what you need. This approach will definitely not be quick, but let's not worry about that just yet.

One possible alternative approach is to do something with the built-in full-text searching. I'm not sure I've got time to play around with that for you though.

--data
if object_id('tempdb.dbo.#TEMP') is not null drop table #TEMP
CREATE TABLE #TEMP(tID INT IDENTITY(1, 1) NOT NULL, Field1 CHAR(50) NOT NULL)

INSERT INTO #TEMP(Field1) VALUES('Administrator Financial Sector')
INSERT INTO #TEMP(Field1) VALUES('Account Planner')
INSERT INTO #TEMP(Field1) VALUES('Account Support Specialist')
INSERT INTO #TEMP(Field1) VALUES('Accountant')
INSERT INTO #TEMP(Field1) VALUES('Accountant 4')
INSERT INTO #TEMP(Field1) VALUES('Accountant Attorney')
INSERT INTO #TEMP(Field1) VALUES('Accountant Cost & Management')
INSERT INTO #TEMP(Field1) VALUES('Accountant General 1 Entry')
INSERT INTO #TEMP(Field1) VALUES('Accountant Municipal')
INSERT INTO #TEMP(Field1) VALUES('Accountant Public')
INSERT INTO #TEMP(Field1) VALUES('Accounting Auditing Clerk')
INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk 4')
INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk Entry Level')
INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant')
INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant2')
INSERT INTO #TEMP(Field1) VALUES('Accounting Director')
INSERT INTO #TEMP(Field1) VALUES('Account Director')
INSERT INTO #TEMP(Field1) VALUES('Accounting Machine Servicer')
INSERT INTO #TEMP(Field1) VALUES('Accounting Section Supervisor')
INSERT INTO #TEMP(Field1) VALUES('Accounting Sections Supervisor')
INSERT INTO #TEMP(Field1) VALUES('Accounting Sections Supervisors')
INSERT INTO #TEMP(Field1) VALUES('Accounting Teacher')
INSERT INTO #TEMP(Field1) VALUES('Accounting Team Leader')
INSERT INTO #TEMP(Field1) VALUES('Accounting Technician')
INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Clerk')
INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Manager')
INSERT INTO #TEMP(Field1) VALUES('Administrators Financial Sector')
go

--calculation

--split words out into extra columns

alter table #TEMP add w1 varchar(20), w2 varchar(20), w3 varchar(20), w4 varchar(20), w5 varchar(20)
go

declare @i1 int, @i2 int, @i3 int, @i4 int, @i5 int


update #TEMP set
@i1 = charindex(' ', Field1),
w1 = left(Field1, @i1-1),
@i2 = charindex(' ',Field1,@i1+1),
w2 = substring(Field1, @i1+1, @i2-@i1-1),
@i3 = charindex(' ',Field1, @i2+1),
w3 = substring(Field1, @i2+1, @i3-@i2-1),
@i4 = charindex(' ',Field1, @i3+1),
w4 = substring(Field1, @i3+1, @i4-@i3-1),
w5 = substring(Field1, @i4+1, 50)

--truncate the words
update #TEMP set
w1 = nullif(left(w1, 0.75 * len(w1)), ''),
w2 = nullif(left(w2, 0.75 * len(w2)), ''),
w3 = nullif(left(w3, 0.75 * len(w3)), ''),
w4 = nullif(left(w4, 0.75 * len(w4)), ''),
w5 = nullif(left(w5, 0.75 * len(w5)), '')

--compare truncated words
select * from #TEMP a inner join #TEMP b on
a.tid < b.tid and
(a.w1 like b.w1 + '%' or b.w1 like a.w1 + '%' or (a.w1 is null and b.w1 is null)) and
(a.w2 like b.w2 + '%' or b.w2 like a.w2 + '%' or (a.w2 is null and b.w2 is null)) and
(a.w3 like b.w3 + '%' or b.w3 like a.w3 + '%' or (a.w3 is null and b.w3 is null)) and
(a.w4 like b.w4 + '%' or b.w4 like a.w4 + '%' or (a.w4 is null and b.w4 is null)) and
(a.w5 like b.w5 + '%' or b.w5 like a.w5 + '%' or (a.w5 is null and b.w5 is null))


Ryan Randall
www.monsoonmalabar.com London-based IT consultancy

Solutions are easy. Understanding the problem, now, that's the hard part.
Go to Top of Page
   

- Advertisement -