converting an html table to a tree

Thu Aug 24 04:13:45 EDT 2000

[posted AND mailed]

"Ian Lipsky" <NOSPAM at pacificnet.net> wrote in message
news:to2p5.444$3Q6.18123 at newsread2.prod.itd.earthlink.net...
> hi all,
>
> I'm completely new to python...just started reading learning python. I've
> got 5 days to figure out how to write a script to take an html table and
> convert it to a tree....basically a nested array (that's my guess on how
it
> would be done anyhow). Oh yeah...and I also have to drive 3000 miles in
> those 5 days ;)p

Despite Python's ease, coding Python while actually driving is
a practice to be discouraged.  Your Python code will probably
come out all right, but your car might crash in the meantime.

> I was hoping someone could give me a push in the right direction. What
> functions or whatever should I look at to get this done? I saw that in the
> book it mentions pythons ability to grab html and parse it, as one if the
> pluses of the language ('internet utility modules' is what the book called
> it/them).

Standard modules htmllib and sgmllib will indeed help you.

> And I just found I don't have to actually go out and grab the page off a
> webserver. It'll be a file residing on the machine where the script will
be
> run. I assume that'll make it a little easier for me :)

Not by much, since getting data off an arbitrary URL is so easy with
Python, but, yes, you can reduce your 'main program' to two lines:

    parser.feed(open('myfile.html').read())
    parser.close()

once you have properly instantiated the 'parser' instance you need.

Basically, you want to derive your class from htmllib.HTMLParser, and
add methods to handle the tags you're specifically interested in -- for
the problem you stated, table-related tags.

For any tag-name FOO, you need to define in your class, either one method:
    def do_foo(self, attributes):
        # do whatever
if the tag does not require a corresponding close-tag (e.g., <br>); or,
more commonly, two methods:
    def start_foo(self, attributes):
        # opening stuff
    def end_foo(self):
        # closing stuff
if both an opening and a closing tag will be there (<table> ... </table>,
and similar cases).

The 'attributes' argument is a (possibly empty) list of (name,value) pairs.

Further, you'll want to define a method in your class:
    def handle_data(self, data):
        # whatever
that will receive all textual data.  Of course, you'll have flags
you maintain on start/end methods telling you whether the data must
simply be discarded, or how it is to be processed if relevant.

Now to your specific case.  The tags you may want to handle are:

TABLE, CAPTION, COL, COLGROUP, TBODY, TD, TFOOT, TH, THEAD, TR.

(have I missed any...?).  COL, I believe, is the only one that
does not require a closing-tag (although I think COLGROUP has an
_optional_ closing-tag if COL elements are not nested in it, but
I'm not sure).  Which of these tags carry significant information
for your purposes...?

The general structure might be:
    TABLE
        CAPTION
        THEAD
        TBODY
        TFOOT
CAPTION is optional.  So is each of THEAD, TBODY, TFOOT: if none
is explicitly specified, TBODY is implied.  Each of THEAD, TBODY,
TFOOT has contents:
    THEAD|TBODY|TFOOT:
        TR
            TH
            TD
Zero or more TH and TD within each TR, zero or more TR's.

I'm skipping COL, COLGROUP, and the attributes, as I think they
are basically presentational only, and you seem interested in
content-structuring instead.

Now, we need more precise specs: what kinds of tables do you
need to parse, and how do you want to structure (and output?)
the data they contain, depending on caption/thead/tbody/tfoot
and tr/th/td issues...?

Let's take a very simple case to make things more definite.

We process TABLE elements where only TBODY is interesting --
THEAD and TFOOT, we skip silently.  Similarly, we skip TH
and its contents too: we're only interested in:
    TABLE
        TBODY (may be implied)
            TR (zero or more)
                TD (zero or more)
                    data contents of TD tags only
As a result, we return a list (Python's normal data structure
for sequences; 'array' is very specialized in Python) where
each element corresponds to one row (TR); each element in
the list is another, nested, list, where each element
corresponds to the data in a TD, in sequence.

Our class will expect to be 'fed' a document fragment
containing exactly one TABLE (the TABLE tag will have
to be explicit), and will ignore anything outside of
that tag as well as any redundant or nested TABLE tags
that may also be present.  This is basically for simplicity;
you will have to think deep about what you want to do
in each of these cases!  And add good error diagnosis...

We'll also basically assume decent nesting rather than
go out of our way to accept peculiarly structured tables;
this, too, will need in-depth review!

import htmllib
import formatter
import string
import pprint

class TableParser(htmllib.HTMLParser):
    def __init__(self):
        self.active=0
        self.finished=0
        self.skipping=0
        self.result=[]
        self.current_row=[]
        self.current_data=[]
        htmllib.HTMLParser.__init__(
            self, formatter.NullFormatter())
    def start_table(self,attributes):
        if not self.finished:
            self.active=1
    def end_table(self):
        self.active=0
        self.finished=1
    def start_tbody(self,attributes):
        self.skipping=0
    def end_tbody(self):
        self.skipping=1
    def start_thead(self,attributes):
        self.skipping=1
    def end_thead(self):
        self.skipping=0
    def start_tfoot(self,attributes):
        self.skipping=1
    def end_tfoot(self):
        self.skipping=0
    def start_caption(self,attributes):
        self.skipping=1
    def end_caption(self):
        self.skipping=0
    def start_th(self,attributes):
        self.skipping=self.skipping+1
    def end_th(self):
        self.skipping=self.skipping-1
    def start_tr(self,attributes):
        if self.active and not self.skipping:
            self.current_row = []
    def end_tr(self):
        if self.active and not self.skipping:
            self.result.append(self.current_row)
    def start_td(self,attributes):
        if self.active and not self.skipping:
            self.current_data = []
    def end_td(self):
        if self.active and not self.skipping:
            self.current_row.append(
                string.join(self.current_data))
    def handle_data(self, data):
        if self.active and not self.skipping:
            self.current_data.append(data)

def process(filename):
    parser=TableParser()
    parser.feed(open(filename).read())
    parser.close()
    return parser.result

def showparse(filename):
    pprint.pprint(process(filename))

def _test():
    return showparse('c:/atable.htm')

if __name__=='__main__':
    _test()

With c:/atable.htm contents being, for example:

<TABLE BORDER=1 WIDTH=80%>
<THEAD>
<TR>
<TH>Heading 1</TH>
<TH>Heading 2</TH>
</TR>
</THEAD>
<TBODY>
<TR>
<TD>Row 1, Column 1 text.</TD>
<TD>Row 1, Column 2 text.</TD>
</TR>
<TR>
<TD>Row 2, Column 1 text.</TD>
<TD>Row 2, Column 2 text.</TD>
</TR>
</TBODY>
</TABLE>

running the _test function will emit:

>>> tableparse._test()
[['Row 1, Column 1 text.', 'Row 1, Column 2 text.'],
 ['Row 2, Column 1 text.', 'Row 2, Column 2 text.']]
>>>

I hope this gives you a somewhat usable start on
your problem.

Alex