getting tables out

Michael P. Reilly arcege at shore.net
Mon May 24 12:09:45 EDT 1999


Michael Spalinski <mspal at sangria.harvard.edu> wrote:

: I would like to write a Python script that would read an HTML document and
: extract table contents from it. Eg. each table could be a list of tuples
: with data from the rows. I thought htmllib would provide the basic tools
: for this, but I can't find any example that would be of use. 

: So - does anyone have a Python snippet that looks for tables and gets at
: the data?

It shouldn't be to hard to make a subclass of the htmllib.HTMLParser
class that scans for TABLE, TR and TD (and maybe TH) tags.

  from htmllib import HTMLParser
  class TableExtractor(HTMLParser):
    def __init__(self, formatter=None):
    HTMLParser.__init__(self, formatter):
    self.tablelist = []
    self.current_table = None
    self.table_stack = None  # for nested tables
  def start_table(self, attributes):
    if self.current_table is not None:
      self.table_stack = self.current_table, self.table_stack
    self.current_table = []
  def end_table(self):
    self.tablelist.append(self.current_table)
    if self.table_stack:
      self.current_table, self.table_stack = self.table_stack
  def start_tr(self, attributes):
    self.current_table.append([])
  def end_tr(self):
    pass
  def start_td(self, attributes):
    self.current_table[-1].append([])
  def end_td(self):
    pass
  def handle_data(self, data):
    if self.current_table:
      self.current_table[-1][-1].append(data)

The result is in self.tablelist (as list of the tables, since you can have
more than one table in a document).

I haven't really tested this, so it might need a little more work, 
but I think you get the idea.  You need to read the module docs for
sgmllib and htmllib (http://www.python.org/doc/current/lib/).

  -Arcege





More information about the Python-list mailing list