getting tables out

Tom Bryan tbryan at zarlut.utexas.edu
Mon May 24 12:44:24 EDT 1999


"Michael P. Reilly" wrote:
> 
> Michael Spalinski <mspal at sangria.harvard.edu> wrote:
> 
> : I would like to write a Python script that would read an HTML document and
> : extract table contents from it. Eg. each table could be a list of tuples
> : with data from the rows. I thought htmllib would provide the basic tools
> : for this, but I can't find any example that would be of use.
> 
> : So - does anyone have a Python snippet that looks for tables and gets at
> : the data?
> 
> It shouldn't be to hard to make a subclass of the htmllib.HTMLParser
> class that scans for TABLE, TR and TD (and maybe TH) tags.

Depending on what he wants to do, this may or may not be a good idea.  
I've found HTMLParser to be a rather slow solution for parsing many 
files for a small subset of tags.  It does a lot of extra work to 
process all of the tags.  (If I remember correctly, it calls a method 
at every tag, even if that method doesn't really do anything.)  

If he just wants to process an HTML file now and then, it probably 
doen't matter.  If he's extracting all of the tables from hundreds 
of HTML documents on a site, he probably will notice the speed problem.
In the second case, I'd probably just write something that looks 
for the first TABLE tag in the file and grabs everything up to the 
first /TABLE tag.  The re module would do this nicely.  Splitting the 
rows and columns out of the table might be a pain, but I *imagine* that 
it would still be faster than an HTMLParser solution.  Of course, 
imaginations are tricky things. :)

I've thought about subclassing HTMLParser so that it could be used 
to process just a few tags (like table-related tags) quickly.  Has
anyone else done such a thing?

-- 
tbryan at zarlut.utexas.edu
Remove the z from this address to reply.
Stop spam!  http://spam.abuse.net/spam/




More information about the Python-list mailing list