[Chicago] web page content scraper

Ian Bicking ianb at colorstudy.com
Thu Apr 10 01:02:36 CEST 2008


Massimo Di Pierro wrote:
> I posted my toy screen scraper on
> 
>     http://mdp.cti.depaul.edu/examples/static/scraper.py

This link doesn't work for me.

> There seem to be a lot of expertise on the list on this topic so  
> perhaps you can help make it better or just use it to make your  
> better. Currently it scrapes correctly wikipedia page given two  
> examples, extracts all repeated tags, removes all text and replaces  
> it with "text", finds all link and it handles javascript properly.
> 
> It uses the LCS applied to symbols (tags) instead of character.  
> Something I suggested when Adrian gave us an excellent presentation  
> on this topic.
> 
> What's missing:
> 1) after it finds all links, which it does (except for links in css  
> and javascript) it should loop over then, download the images, rename  
> them and rename the links.
> 2) It may introduce some spurious close tags. They have to be removed.

The project I worked on during Code-for-a-Cause is probably related to 
this: http://code.google.com/p/scrapy/

I was trying to finish it up when appengine got me distracted.  I just 
commited the incomplete web interface, with is meant to be a kind of 
wizard -- first you get the files, then you rewrite URLs, then you strip 
down content.

> Eventually I would like to add a button to the web2py admin interface  
> that says: "make my app look like that one" and it will scrape the  
> other one, download images and build a new web2py template. I am  
> close but I could use some help.

Deliverance (http://openplans.org/projects/deliverance) can basically do 
this -- you point it at the site you want to clone, and tell it what 
area to put the content of the original request into.

   Ian


More information about the Chicago mailing list