converting an html table to a tree

Alex Martelli alex at magenta.com
Thu Aug 24 18:17:40 EDT 2000


"Ian Lipsky" <NOSPAM at pacificnet.net> wrote in message
news:3ufp5.440$bw2.8538 at newsread2.prod.itd.earthlink.net...
    [snip]
> > A <TR> could contain <TH>.  What would you want to do with those?
    [snip]
> Hmm...true i forgot about that. Actually, it could have a whole load of
tags
> inside the <TD> tags...font, bold etc. Since i'm only concerned with the
> data and not the formatting, i'll just have to make sure i put something
in
> so that once its inside the <td> tags, it ignores the opening tag < and
the
> closing tag > and everything between it, unless its </td>
>
> I know i saw a bit of code dealing with doing something like that...i
think
> it was using regexp? i'll have to dig it up.

If you use sgmllib/httplib, along the lines of my example, you won't
have to worry about that -- it's taken care of.  Forget trying to do it
with regular expressions; the HTML is already being parsed for you
(with a lot of care to parse its umpteen anomalies correctly), why
re-do the work?

The issue I was commending to your attention is different:

<TABLE>
<THEAD>
    <TR> <TH>A header</TH> <TH>Another</TH> </TR>
</THEAD>
<TBODY>
    <TR> <TD>Some data</TD> <TD>Some more</TD> </TR>
</TBODY>
</TABLE>

What do you want to come out of this?  I suspect ignoring the
<THEAD> is probably closest to your needs, as, also, ignoring
a <TR> that contains no <TD>'s (but rather <TH>'s).


Alex






More information about the Python-list mailing list