Rookie: Parsing the nth table out of an html doc

Alex Martelli aleax at aleax.it
Thu Mar 20 02:44:03 EST 2003


footnipple wrote:

> Hello All,
> 
> I've "googled" and "rtfm'ed", and I'm just not getting how to do this
> from the python documentation...
> 
> I would simply like to put the contents of the nth table of an html
> doc into a list of tuples for further insertion into a db.
> 
> What module -> method(s) should I focus on, and is anyone fimiliar
> with any sample code that does something like this explicitly.
> 
> I guess I just need a few sentances to shove me in the right
> direction. Thanks in advance.

You can parse HTML by writing your own subclass of either of two
in the Standard Python Library: htmlllib.HTMLParser, or
HTMLParser.HTMLParser.  You call the .feed method on an instance
of your class, passing successive pieces of the HTML you're
parsing, and finally .close again on that instance of your class.
You get callbacks to methods of your instance as various things
get parsed.  For a HTMLParser.HTMLParser subclass, that's basically
handle_starttag and handle_endtag as opening and closing tags are
met, and handle_data for data; htmlllib.HTMLParser lets you
define specific per-tag methods as an alternative.

A more specific example is the one I posted about 3 years ago
(very long URL, can't make it tiny as tinyurl.com ain't working):

http://groups.google.com/groups?q=html+table+htmlparser+group:comp.lang.python+author:alex+author:martelli&hl=en&lr=lang_en&ie=UTF-8&oe=utf-8&selm=8o2lok0160q%40news2.newsguy.com&rnum=1


Alex





More information about the Python-list mailing list