Author |
Topic |
stowaway
Starting Member
7 Posts |
Posted - 2010-12-14 : 01:50:06
|
So I have a database that I store PDFs as Varbinarys. (this has already been created.. i know i probalby should have just stored a reference and kept the files on the file system.. any how)before I scan my PDFS i put them thru a commerical OCR software which turns them into searchable pdf's.My question is how should I get that text and put it into the database so that I can search thru the text of the pdf's? can someone point me in the right direction?ps- commerical software isnt out of the question as long as its not over 1000$ |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-12-14 : 02:53:44
|
Dump the text into varchar(max) and use full text search?--Gail ShawSQL Server MVP |
 |
|
stowaway
Starting Member
7 Posts |
Posted - 2010-12-14 : 05:34:10
|
Posted - 12/14/2010 : 02:53:44 Dump the text into varchar(max) and use full text search?---was thinking that. however that'll slow it down alot wont it?surely there must be better way to index it? |
 |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-12-14 : 05:39:55
|
You can roll your own full text search if you want. Split the text into keywords, put each keyword into a keywords table (one word per row) and write your own queries to do the matches, but that's reinventing the wheelSearching through large chunks of unstructured text is what full text search is good at. Normal SQL indexes fail badly at that.There are likely 3rd party tools available, what and how much I can't answer.p.s. I assume you don't have a Sharepoint installation?--Gail ShawSQL Server MVP |
 |
|
Sachin.Nand
2937 Posts |
Posted - 2010-12-14 : 06:05:19
|
If your 3rd part software creates physical pdf files then you can try the Windows Indexing Services which will search for a the specific text in the folder where the files are been created.PBUH |
 |
|
stowaway
Starting Member
7 Posts |
Posted - 2010-12-14 : 06:41:03
|
the files are kept on a database so i dont think i can use windows index servicing.Its fairly easy for me to dump the text into a varchar(max) and then just do a select * from blah where text like %whatever%thats simple.i could be over thinking it but i thought it'd be slow.. given that there could possible be 2000 tuples to search through.. (2000 with a large amount of text in each..) |
 |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-12-14 : 07:04:52
|
That'll be dead slow. Absolutely dead slow. It'll be a table scan each time since standard SQL indexes cannot seek when the operation is a like with a leading wildcard. That's why I suggested full text indexing which is designed for exactly this kind of searching. Since you don't seem to be in favour of full text indexing, I won't post any further here.--Gail ShawSQL Server MVP |
 |
|
Sachin.Nand
2937 Posts |
Posted - 2010-12-14 : 07:10:21
|
quote: the files are kept on a database so i dont think i can use windows index servicing.
You have to enable it on the database server.I had faced the same issue.The windows indexing service returned the results in less than a second in about 8000 files.PBUH |
 |
|
stowaway
Starting Member
7 Posts |
Posted - 2010-12-14 : 15:45:40
|
quote: Originally posted by GilaMonster That'll be dead slow. Absolutely dead slow. It'll be a table scan each time since standard SQL indexes cannot seek when the operation is a like with a leading wildcard. That's why I suggested full text indexing which is designed for exactly this kind of searching. Since you don't seem to be in favour of full text indexing, I won't post any further here.--Gail ShawSQL Server MVP
Im not against full text indexing. Im just a SQL noob and didnt realise the solution you were offering.I shall do some google research into it.but you are saying that I chuck it into a varchar(max) and there is a way to index and search the text efficiently.. |
 |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-12-15 : 00:50:45
|
http://msdn.microsoft.com/en-us/library/ms142571%28v=SQL.100%29.aspx--Gail ShawSQL Server MVP |
 |
|
|