Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
Lopaka
Starting Member
48 Posts |
Posted - 2006-05-05 : 19:44:03
|
| I am trying to take a field and check to see if there are duplicate data. In the following script, #temp gets created and data gets inserted. There are two fields identicaly input, "Administrators Financial Sector". One with an 's' and one without. I am trying to compare the first 80% of the word(s), for example: "Administrat", "Financi", "Sect". Then compares all three words to the rest of the records and returns both records. Keep in mind this is going thru ~100K records.Does anybody have an idea?CREATE TABLE #TEMP( tID INT IDENTITY(1, 1) NOT NULL , Field1 CHAR(50) NOT NULL)INSERT INTO #TEMP(Field1) VALUES('Account Administrator Financial Sector')INSERT INTO #TEMP(Field1) VALUES('Account Planner')INSERT INTO #TEMP(Field1) VALUES('Account Support Specialist')INSERT INTO #TEMP(Field1) VALUES('Accountant')INSERT INTO #TEMP(Field1) VALUES('Accountant 4')INSERT INTO #TEMP(Field1) VALUES('Accountant Attorney')INSERT INTO #TEMP(Field1) VALUES('Accountant Cost & Management')INSERT INTO #TEMP(Field1) VALUES('Accountant General 1 Entry')INSERT INTO #TEMP(Field1) VALUES('Accountant Municipal')INSERT INTO #TEMP(Field1) VALUES('Accountant Public')INSERT INTO #TEMP(Field1) VALUES('Accounting Auditing Clerk')INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk 4')INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk Entry Level')INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant')INSERT INTO #TEMP(Field1) VALUES('Accounting Director')INSERT INTO #TEMP(Field1) VALUES('Accounting Machine Servicer')INSERT INTO #TEMP(Field1) VALUES('Accounting Section Supervisor')INSERT INTO #TEMP(Field1) VALUES('Accounting Teacher')INSERT INTO #TEMP(Field1) VALUES('Accounting Team Leader')INSERT INTO #TEMP(Field1) VALUES('Accounting Technician')INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Clerk')INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Manager')INSERT INTO #TEMP(Field1) VALUES('Administrators Financial Sector')Robert R. Barnes |
|
|
khtan
In (Som, Ni, Yak)
17689 Posts |
Posted - 2006-05-05 : 21:36:32
|
Will this do ?select *from #TEMP t1 inner join #TEMP t2on t1.Field1 like left(t2.Field1, 0.8 * len(t2.Field1)) + '%'and t1.Field1 <> t2.Field1 KH |
 |
|
|
Lopaka
Starting Member
48 Posts |
Posted - 2006-05-06 : 00:02:18
|
| This returns all fields with one word identical. The object would be to return on the 'Administrator Financial Sector' and 'Administrators Financial Sector'. Unfortunately, I just noticed the first has accountant in it. Need to use the following:CREATE TABLE #TEMP(tID INT IDENTITY(1, 1) NOT NULL ,Field1 CHAR(50) NOT NULL)INSERT INTO #TEMP(Field1) VALUES('Administrator Financial Sector')INSERT INTO #TEMP(Field1) VALUES('Account Planner')INSERT INTO #TEMP(Field1) VALUES('Account Support Specialist')INSERT INTO #TEMP(Field1) VALUES('Accountant')INSERT INTO #TEMP(Field1) VALUES('Accountant 4')INSERT INTO #TEMP(Field1) VALUES('Accountant Attorney')INSERT INTO #TEMP(Field1) VALUES('Accountant Cost & Management')INSERT INTO #TEMP(Field1) VALUES('Accountant General 1 Entry')INSERT INTO #TEMP(Field1) VALUES('Accountant Municipal')INSERT INTO #TEMP(Field1) VALUES('Accountant Public')INSERT INTO #TEMP(Field1) VALUES('Accounting Auditing Clerk')INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk 4')INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk Entry Level')INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant')INSERT INTO #TEMP(Field1) VALUES('Accounting Director')INSERT INTO #TEMP(Field1) VALUES('Accounting Machine Servicer')INSERT INTO #TEMP(Field1) VALUES('Accounting Section Supervisor')INSERT INTO #TEMP(Field1) VALUES('Accounting Teacher')INSERT INTO #TEMP(Field1) VALUES('Accounting Team Leader')INSERT INTO #TEMP(Field1) VALUES('Accounting Technician')INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Clerk')INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Manager')INSERT INTO #TEMP(Field1) VALUES('Administrators Financial Sector')Robert R. Barnes |
 |
|
|
RyanRandall
Master Smack Fu Yak Hacker
1074 Posts |
Posted - 2006-05-08 : 05:30:00
|
Hi Robert,This will work for the example you've pointed out (as I understand it), but I'm guessing it's not going to be sufficient for your whole problem. Could you post some more examples (i.e. ones this won't work for). Thanks.  select * from #TEMP a inner join #TEMP b on not a.tid = b.tid and a.Field1 = replace(b.Field1 + ' ', 's ', ' ') Ryan Randallwww.monsoonmalabar.com London-based IT consultancy Solutions are easy. Understanding the problem, now, that's the hard part. |
 |
|
|
Lopaka
Starting Member
48 Posts |
Posted - 2006-05-08 : 11:57:37
|
| Your idea would work, if i were only looking for words with a trailing 's'. Unfortunately, the word could be a little more different. I was thinking this weekend about using left 75% of each word in Field1. This would ensure a higher return on the observations. I have added a couple of more 's' words so now you get the 4 identical. I also added 'accounting Director' and 'Account Director'; using the 75% rule, this would return as a duplicate.Also, sorry about the example data. I am not very good at making stuff up. :(CREATE TABLE #TEMP(tID INT IDENTITY(1, 1) NOT NULL ,Field1 CHAR(50) NOT NULL)INSERT INTO #TEMP(Field1) VALUES('Administrator Financial Sector')INSERT INTO #TEMP(Field1) VALUES('Account Planner')INSERT INTO #TEMP(Field1) VALUES('Account Support Specialist')INSERT INTO #TEMP(Field1) VALUES('Accountant')INSERT INTO #TEMP(Field1) VALUES('Accountant 4')INSERT INTO #TEMP(Field1) VALUES('Accountant Attorney')INSERT INTO #TEMP(Field1) VALUES('Accountant Cost & Management')INSERT INTO #TEMP(Field1) VALUES('Accountant General 1 Entry')INSERT INTO #TEMP(Field1) VALUES('Accountant Municipal')INSERT INTO #TEMP(Field1) VALUES('Accountant Public')INSERT INTO #TEMP(Field1) VALUES('Accounting Auditing Clerk')INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk 4')INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk Entry Level')INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant')INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant2')INSERT INTO #TEMP(Field1) VALUES('Accounting Director')INSERT INTO #TEMP(Field1) VALUES('Account Director')INSERT INTO #TEMP(Field1) VALUES('Accounting Machine Servicer')INSERT INTO #TEMP(Field1) VALUES('Accounting Section Supervisor')INSERT INTO #TEMP(Field1) VALUES('Accounting Sections Supervisor')INSERT INTO #TEMP(Field1) VALUES('Accounting Sections Supervisors')INSERT INTO #TEMP(Field1) VALUES('Accounting Teacher')INSERT INTO #TEMP(Field1) VALUES('Accounting Team Leader')INSERT INTO #TEMP(Field1) VALUES('Accounting Technician')INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Clerk')INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Manager')INSERT INTO #TEMP(Field1) VALUES('Administrators Financial Sector')Robert R. Barnes |
 |
|
|
RyanRandall
Master Smack Fu Yak Hacker
1074 Posts |
Posted - 2006-05-08 : 12:49:56
|
Okay, here's the next 'stab'. I've stuck to the rough algorithm you've suggested and so you can at least get to grips with one possible technique. Again please let us know any issues with this approach and give examples of where it doesn't give what you need. This approach will definitely not be quick, but let's not worry about that just yet. One possible alternative approach is to do something with the built-in full-text searching. I'm not sure I've got time to play around with that for you though.--dataif object_id('tempdb.dbo.#TEMP') is not null drop table #TEMPCREATE TABLE #TEMP(tID INT IDENTITY(1, 1) NOT NULL, Field1 CHAR(50) NOT NULL)INSERT INTO #TEMP(Field1) VALUES('Administrator Financial Sector')INSERT INTO #TEMP(Field1) VALUES('Account Planner')INSERT INTO #TEMP(Field1) VALUES('Account Support Specialist')INSERT INTO #TEMP(Field1) VALUES('Accountant')INSERT INTO #TEMP(Field1) VALUES('Accountant 4')INSERT INTO #TEMP(Field1) VALUES('Accountant Attorney')INSERT INTO #TEMP(Field1) VALUES('Accountant Cost & Management')INSERT INTO #TEMP(Field1) VALUES('Accountant General 1 Entry')INSERT INTO #TEMP(Field1) VALUES('Accountant Municipal')INSERT INTO #TEMP(Field1) VALUES('Accountant Public')INSERT INTO #TEMP(Field1) VALUES('Accounting Auditing Clerk')INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk 4')INSERT INTO #TEMP(Field1) VALUES('Accounting Clerk Entry Level')INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant')INSERT INTO #TEMP(Field1) VALUES('Accounting Consultant2')INSERT INTO #TEMP(Field1) VALUES('Accounting Director')INSERT INTO #TEMP(Field1) VALUES('Account Director')INSERT INTO #TEMP(Field1) VALUES('Accounting Machine Servicer')INSERT INTO #TEMP(Field1) VALUES('Accounting Section Supervisor')INSERT INTO #TEMP(Field1) VALUES('Accounting Sections Supervisor')INSERT INTO #TEMP(Field1) VALUES('Accounting Sections Supervisors')INSERT INTO #TEMP(Field1) VALUES('Accounting Teacher')INSERT INTO #TEMP(Field1) VALUES('Accounting Team Leader')INSERT INTO #TEMP(Field1) VALUES('Accounting Technician')INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Clerk')INSERT INTO #TEMP(Field1) VALUES('Accounts Payable & Receivable Manager')INSERT INTO #TEMP(Field1) VALUES('Administrators Financial Sector')go--calculation--split words out into extra columnsalter table #TEMP add w1 varchar(20), w2 varchar(20), w3 varchar(20), w4 varchar(20), w5 varchar(20)godeclare @i1 int, @i2 int, @i3 int, @i4 int, @i5 intupdate #TEMP set @i1 = charindex(' ', Field1), w1 = left(Field1, @i1-1), @i2 = charindex(' ',Field1,@i1+1), w2 = substring(Field1, @i1+1, @i2-@i1-1), @i3 = charindex(' ',Field1, @i2+1), w3 = substring(Field1, @i2+1, @i3-@i2-1), @i4 = charindex(' ',Field1, @i3+1), w4 = substring(Field1, @i3+1, @i4-@i3-1), w5 = substring(Field1, @i4+1, 50)--truncate the wordsupdate #TEMP set w1 = nullif(left(w1, 0.75 * len(w1)), ''), w2 = nullif(left(w2, 0.75 * len(w2)), ''), w3 = nullif(left(w3, 0.75 * len(w3)), ''), w4 = nullif(left(w4, 0.75 * len(w4)), ''), w5 = nullif(left(w5, 0.75 * len(w5)), '')--compare truncated wordsselect * from #TEMP a inner join #TEMP b on a.tid < b.tid and (a.w1 like b.w1 + '%' or b.w1 like a.w1 + '%' or (a.w1 is null and b.w1 is null)) and (a.w2 like b.w2 + '%' or b.w2 like a.w2 + '%' or (a.w2 is null and b.w2 is null)) and (a.w3 like b.w3 + '%' or b.w3 like a.w3 + '%' or (a.w3 is null and b.w3 is null)) and (a.w4 like b.w4 + '%' or b.w4 like a.w4 + '%' or (a.w4 is null and b.w4 is null)) and (a.w5 like b.w5 + '%' or b.w5 like a.w5 + '%' or (a.w5 is null and b.w5 is null))Ryan Randallwww.monsoonmalabar.com London-based IT consultancy Solutions are easy. Understanding the problem, now, that's the hard part. |
 |
|
|
|
|
|
|
|