converting an html table to a tree

Alex Martelli alex at magenta.com
Fri Aug 25 08:35:45 EDT 2000


"Sami Hangaslammi" <sami.hangaslammi.spam.trap at yomimedia.fi> wrote in
message news:8o5eo0$aep7$1 at learnet.freenet.hut.fi...
>
> "Ian Lipsky" <NOSPAM at pacificnet.net> wrote in message
> news:3ufp5.440$bw2.8538 at newsread2.prod.itd.earthlink.net...
>
> > I know i saw a bit of code dealing with doing something like that...i
> think
> > it was using regexp? i'll have to dig it up.
>
> A very simple solution using regexp (ignoring all tags except table,tr and
> td) for creating a list of all tables in a document:
>
> |mport re
> |
> |def rex_tag(tag):
> |    return re.compile("(?msi)<%s.*?>(.*?)</%s.*?>" % (tag,tag))

Too simple.  At the very least, place a \b after the %s's, else you
might match some extended tags such as <translate> as if they were
<tr> (and you could get such tags, for example, in an XML data
island, etc, etc).  Also, whitespace is OK (you must match
    < table>
just as you would
    <table>
etc, etc).

But, it's much worse than this; just wait until you hit some
<script> tag that houses JScript (or Python:-) code such as
    if(i<tr) {
[or
    if i<tr:
of course].  (Yes, you can have this even in the most compliant
of XHTML documents -- housed in a CDATA section, of course).
I think you can also hit 'accidental' <-signs in URNs (properly
quoted, of course).  Ah, did I mention
    <--! comments? -->
or didn't I...?


There's no end to the amount of such trouble you can get into,
trying to parse HTML (or XML) documents by regular expressions.
I _strongly_ urge anybody having to parse HTML (or XML) to
rely on suitable parsers rather than trying to roll their own.

Python's htmllib and sgmllib may not be perfect, but they're
much better than nothing, and I'm positive they will reduce
your stress-level (and number of obscure never-tested bugs
waiting to happen) compared with the roll-your-own approach.


Alex






More information about the Python-list mailing list