[Chicago] web page content scraper

Thu Apr 10 01:02:36 CEST 2008

Massimo Di Pierro wrote:
> I posted my toy screen scraper on
> 
>     http://mdp.cti.depaul.edu/examples/static/scraper.py

This link doesn't work for me.

> There seem to be a lot of expertise on the list on this topic so  
> perhaps you can help make it better or just use it to make your  
> better. Currently it scrapes correctly wikipedia page given two  
> examples, extracts all repeated tags, removes all text and replaces  
> it with "text", finds all link and it handles javascript properly.
> 
> It uses the LCS applied to symbols (tags) instead of character.  
> Something I suggested when Adrian gave us an excellent presentation  
> on this topic.
> 
> What's missing:
> 1) after it finds all links, which it does (except for links in css  
> and javascript) it should loop over then, download the images, rename  
> them and rename the links.
> 2) It may introduce some spurious close tags. They have to be removed.

The project I worked on during Code-for-a-Cause is probably related to 
this: http://code.google.com/p/scrapy/

I was trying to finish it up when appengine got me distracted.  I just 
commited the incomplete web interface, with is meant to be a kind of 
wizard -- first you get the files, then you rewrite URLs, then you strip 
down content.

> Eventually I would like to add a button to the web2py admin interface  
> that says: "make my app look like that one" and it will scrape the  
> other one, download images and build a new web2py template. I am  
> close but I could use some help.

Deliverance (http://openplans.org/projects/deliverance) can basically do 
this -- you point it at the site you want to clone, and tell it what 
area to put the content of the original request into.

   Ian