Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 Site Related Forums
 The Yak Corral
 The bottom of the barrel

Author  Topic 

elwoos
Master Smack Fu Yak Hacker

2052 Posts

Posted - 2005-07-01 : 18:20:37
Is anyone familiar with web/data scraping? If I've understood correctly it is a way of sucking data from a website/app etc. into a database. So for example it would be useful for me to pull data from those websites that I need that allow you to search but not to download their data. Is this correct? How does it work?


cheers

steve

Alright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer.

Merkin
Funky Drop Bear Fearing SQL Dude!

4970 Posts

Posted - 2005-07-01 : 21:40:40
Your definition is correct, Screen scraping is grabbing the html from a site, parsing out the useful bits, and doing something with it.

You can write something to grab the html in .NET, or use a tool like wget to save all the html locally and work with it there.

The term is pretty loose, depending on what you need to do.


Damian
"A foolish consistency is the hobgoblin of little minds." - Emerson
Go to Top of Page

spirit1
Cybernetic Yak Master

11752 Posts

Posted - 2005-07-02 : 03:45:07
we used that with babelfish for crude translations...
i think they have a web service now...

Go with the flow & have fun! Else fight the flow
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2005-07-02 : 04:49:54
Problem is when they change their site - your scrapper no longer works, and if its something everyone has become dependant on it becomes an emergency to fix it!! - so you might want to use a tool that allows you to use some sort of fuzzy matching, templated "finds", and so on, so that its easy to adjust when Version 2 of the Useful Data Site comes out.

I've done it before using Perl. There's a fairly handly HTTP handler in Perl. You then discover that sites need cookies to maintain session state, and maybe login too, so there's a fair amount to build for a Real World System. Horses for courses.

Kristen
Go to Top of Page

elwoos
Master Smack Fu Yak Hacker

2052 Posts

Posted - 2005-07-02 : 14:52:22
Thanks for that guys, the sort of thing I had in mind is similar to what Spirit mentioned. One of the problems we have is that I am trying to impose use of national datasets that can be found on the web but I can't always get the dataset itself. My biggest concern was what happened when the site changes as Kristen mentioned. I don't know perl at all but .NET is a possibility. I suppose the other problem is parsing the results to get what you need (in my case to put into my database).

cheers

steve
steve

Alright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer.
Go to Top of Page

Thrasymachus
Constraint Violating Yak Guru

483 Posts

Posted - 2005-07-05 : 16:16:04
Are there some legal issues here?

Sean Roussy

Please backup all of your databases including master, msdb and model on a regular basis. I am tired of telling people they are screwed. The job you save may be your own.

I am available for consulting work. Just email me though the forum.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2005-07-05 : 22:01:24
Steve works for a government agency, I'm sure he has got that covered up .

Kristen
Go to Top of Page

Thrasymachus
Constraint Violating Yak Guru

483 Posts

Posted - 2005-07-06 : 11:38:24
That is even scarier.

Sean Roussy

Please backup all of your databases including master, msdb and model on a regular basis. I am tired of telling people they are screwed. The job you save may be your own.

I am available for consulting work. Just email me though the forum.
Go to Top of Page

Sitka
Aged Yak Warrior

571 Posts

Posted - 2005-07-06 : 12:42:10
going thru this (kindof) right now...

We have a Cad application that can generate HTML page with a single table which represents the Bill of Materials.
It also can do a CSV to but it was really ugly (jumped lines to handle too long descriptions) so I decided to build the app that would read the HTML as as stream get rid of the markup (simple Replace) and make a better cleaner CSV. Think I am then going to use Microsoft Text Driver ODBC to populate a datatable. Just working on it now. Once in datatable use ADO.NET to get the latest BOM into a audit table for processing and evaluation against and automating some of the purchasing area of an ERP application.

XML kind of fails in my hands at this point because any XML still needs an application of a schema to convert the positional/serialized <tr><td> sets to proper complex elements. Excel 2003 can do nice mapping for you but I don't think the .NET Interop of EXCEL 11(?) exposes all that is needed to drop the table into a sheet, apply an XML map and then save as XML. If it could then you could build the datatable from the XML. Freaking XML, nothing good to say about it yet.

SO while it isn't really scrapping because the HTML is network path as opposed to URL but same idea.

Here is one example http://www.eggheadcafe.com/articles/20040603.asp there are a few thousand bits and pieces that are out there. Nothing really clear though as almost every search leads you in the direction XML to HTML not the other way.

I wish someone would start an Official XML Rant Thread.
Go to Top of Page

elwoos
Master Smack Fu Yak Hacker

2052 Posts

Posted - 2005-07-06 : 18:26:33
Please excuse me if I'm teaching my granny to suck eggs but I'm half way through an ASP.NET course this week and the trainer is trying to convince me that XML is a good thing (I've always been sceptical about it so far). Anyway the point I am trying to make albeit in a roundabout way is that I think that you can pull the (XML) data that you are talking about into a dataset (I think) and then manipulate it to your hearts content.

If anyone has an app where holding data in an XML format is the most efficient way of storing it I would be interested (and I don't count anything that merely allows you to present data differently via stylesheets)

steve



Alright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2005-07-07 : 11:06:35
Convert a recordset, from a database, into XML for local caching might be a valid example.

But other than that we generally only use it for point-to-point data transfer. Its too blinking slow for normal storage needs - all the parsing and stuff associated with using XML

Kristen
Go to Top of Page

spirit1
Cybernetic Yak Master

11752 Posts

Posted - 2005-07-07 : 11:34:51
we use an xpdl form of an xml.
it's a form used to set each stage of some chain.
it's really usefull. you just set the order of modules to be executed and that's it.
of course you have to build each module

Go with the flow & have fun! Else fight the flow
Go to Top of Page

elwoos
Master Smack Fu Yak Hacker

2052 Posts

Posted - 2005-07-07 : 14:14:46
We looked at caching today. I can see some use for that but of course it doesn't have to be XML, it could be anything that you can put in a dataset. Perhaps Kristen you don't need to convert.

steve

Alright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2005-07-07 : 14:36:40
XML is kinda handy though - out of the box it can encapsulate all the aspects of a SQL Recordset useful for a persistent container. So ... today you can get a recordset from SQL, tomorrow you can "fake" getting that from the database using a stored XML copy instead, and the application will be none the wiser that the data didn't actually come from the DB.

We have a function that calls SQL - you pass the function the Sproc name, and parameteers etc. The function checks a list (names of SProcs that return persistent data (i.e. for a given set of parameters the Sproc will return the same data). If the Sproc name is in the list it checks an in-memory index for the parameters, and if it finds an exact match it pulls the data from cache instead and converts the XML to a recordset and presents that to the caller instead.

So ... if we create a new Sproc that returns persiste data which just add the name to the Cache SProc Name List and thereafter that Sproc is in the cache ...

... purging the cache when the "persistent data" changes , and when memory is full, is left as an exercise for the user!

Kristen
Go to Top of Page
   

- Advertisement -