[Chicago] web page content scraper

Thu Apr 10 00:18:58 CEST 2008

I posted my toy screen scraper on

    http://mdp.cti.depaul.edu/examples/static/scraper.py

There seem to be a lot of expertise on the list on this topic so  
perhaps you can help make it better or just use it to make your  
better. Currently it scrapes correctly wikipedia page given two  
examples, extracts all repeated tags, removes all text and replaces  
it with "text", finds all link and it handles javascript properly.

It uses the LCS applied to symbols (tags) instead of character.  
Something I suggested when Adrian gave us an excellent presentation  
on this topic.

What's missing:
1) after it finds all links, which it does (except for links in css  
and javascript) it should loop over then, download the images, rename  
them and rename the links.
2) It may introduce some spurious close tags. They have to be removed.

Eventually I would like to add a button to the web2py admin interface  
that says: "make my app look like that one" and it will scrape the  
other one, download images and build a new web2py template. I am  
close but I could use some help.

Massimo

On Apr 9, 2008, at 2:23 PM, Christopher Allan Webber wrote:

> It sounds interesting.  I'm interested in seeing the technical reasons
> for the change to lxml, and possibly how that benefitted you.  Maybe
> do another talk (or at least a lightning talk) at another ChiPy
> meeting once you're ready to open it?
>
> "Adrian Holovaty" <web at holovaty.com> writes:
>
>> On Tue, Apr 8, 2008 at 9:25 AM, Tom Printy  
>> <tprinty at mail.edisonave.net> wrote:
>>> Wow this library is super cool. Anyone got slides or notes from the
>>>  talk?
>>
>> Hey, that's my library and was my talk. Note that the current version
>> of templatemaker (on Google Code) is pretty "dumb" when dealing with
>> HTML.
>>
>> Since that talk, I've developed a new one, based on lxml, that
>> analyzes differences in the HTML trees. It's a *lot* better (I'd even
>> call it *awesome*), but I haven't released it open-source yet. Stay
>> tuned.
>>
>> Adrian
>> _______________________________________________
>> Chicago mailing list
>> Chicago at python.org
>> http://mail.python.org/mailman/listinfo/chicago
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago