Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 SQL Server Administration (2008)
 searchable pdf database

Author  Topic 

stowaway
Starting Member

7 Posts

Posted - 2010-12-14 : 01:50:06
So I have a database that I store PDFs as Varbinarys. (this has already been created.. i know i probalby should have just stored a reference and kept the files on the file system.. any how)

before I scan my PDFS i put them thru a commerical OCR software which turns them into searchable pdf's.

My question is how should I get that text and put it into the database so that I can search thru the text of the pdf's? can someone point me in the right direction?

ps- commerical software isnt out of the question as long as its not over 1000$

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-12-14 : 02:53:44
Dump the text into varchar(max) and use full text search?

--
Gail Shaw
SQL Server MVP
Go to Top of Page

stowaway
Starting Member

7 Posts

Posted - 2010-12-14 : 05:34:10
Posted - 12/14/2010 : 02:53:44 Dump the text into varchar(max) and use full text search?
---

was thinking that. however that'll slow it down alot wont it?
surely there must be better way to index it?
Go to Top of Page

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-12-14 : 05:39:55
You can roll your own full text search if you want. Split the text into keywords, put each keyword into a keywords table (one word per row) and write your own queries to do the matches, but that's reinventing the wheel

Searching through large chunks of unstructured text is what full text search is good at. Normal SQL indexes fail badly at that.

There are likely 3rd party tools available, what and how much I can't answer.

p.s. I assume you don't have a Sharepoint installation?

--
Gail Shaw
SQL Server MVP
Go to Top of Page

Sachin.Nand

2937 Posts

Posted - 2010-12-14 : 06:05:19
If your 3rd part software creates physical pdf files then you can try the Windows Indexing Services which will search for a the specific text in the folder where the files are been created.


PBUH

Go to Top of Page

stowaway
Starting Member

7 Posts

Posted - 2010-12-14 : 06:41:03
the files are kept on a database so i dont think i can use windows index servicing.

Its fairly easy for me to dump the text into a varchar(max) and then just do a select * from blah where text like %whatever%

thats simple.

i could be over thinking it but i thought it'd be slow.. given that there could possible be 2000 tuples to search through..
(2000 with a large amount of text in each..)
Go to Top of Page

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-12-14 : 07:04:52
That'll be dead slow. Absolutely dead slow. It'll be a table scan each time since standard SQL indexes cannot seek when the operation is a like with a leading wildcard.

That's why I suggested full text indexing which is designed for exactly this kind of searching. Since you don't seem to be in favour of full text indexing, I won't post any further here.

--
Gail Shaw
SQL Server MVP
Go to Top of Page

Sachin.Nand

2937 Posts

Posted - 2010-12-14 : 07:10:21
quote:
the files are kept on a database so i dont think i can use windows index servicing.


You have to enable it on the database server.I had faced the same issue.The windows indexing service returned the results in less than a second in about 8000 files.

PBUH

Go to Top of Page

stowaway
Starting Member

7 Posts

Posted - 2010-12-14 : 15:45:40
quote:
Originally posted by GilaMonster

That'll be dead slow. Absolutely dead slow. It'll be a table scan each time since standard SQL indexes cannot seek when the operation is a like with a leading wildcard.

That's why I suggested full text indexing which is designed for exactly this kind of searching. Since you don't seem to be in favour of full text indexing, I won't post any further here.

--
Gail Shaw
SQL Server MVP



Im not against full text indexing. Im just a SQL noob and didnt realise the solution you were offering.
I shall do some google research into it.
but you are saying that I chuck it into a varchar(max) and there is a way to index and search the text efficiently..
Go to Top of Page

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-12-15 : 00:50:45
http://msdn.microsoft.com/en-us/library/ms142571%28v=SQL.100%29.aspx

--
Gail Shaw
SQL Server MVP
Go to Top of Page
   

- Advertisement -