getting tables out
Michael P. Reilly
arcege at shore.net
Mon May 24 12:09:45 EDT 1999
Michael Spalinski <mspal at sangria.harvard.edu> wrote:
: I would like to write a Python script that would read an HTML document and
: extract table contents from it. Eg. each table could be a list of tuples
: with data from the rows. I thought htmllib would provide the basic tools
: for this, but I can't find any example that would be of use.
: So - does anyone have a Python snippet that looks for tables and gets at
: the data?
It shouldn't be to hard to make a subclass of the htmllib.HTMLParser
class that scans for TABLE, TR and TD (and maybe TH) tags.
from htmllib import HTMLParser
class TableExtractor(HTMLParser):
def __init__(self, formatter=None):
HTMLParser.__init__(self, formatter):
self.tablelist = []
self.current_table = None
self.table_stack = None # for nested tables
def start_table(self, attributes):
if self.current_table is not None:
self.table_stack = self.current_table, self.table_stack
self.current_table = []
def end_table(self):
self.tablelist.append(self.current_table)
if self.table_stack:
self.current_table, self.table_stack = self.table_stack
def start_tr(self, attributes):
self.current_table.append([])
def end_tr(self):
pass
def start_td(self, attributes):
self.current_table[-1].append([])
def end_td(self):
pass
def handle_data(self, data):
if self.current_table:
self.current_table[-1][-1].append(data)
The result is in self.tablelist (as list of the tables, since you can have
more than one table in a document).
I haven't really tested this, so it might need a little more work,
but I think you get the idea. You need to read the module docs for
sgmllib and htmllib (http://www.python.org/doc/current/lib/).
-Arcege
More information about the Python-list
mailing list