Regular Expression Help for Python Newbie.

Raoul-Sam Daruwala raoul at ez-ways.com
Fri Apr 7 15:19:45 EDT 2000


I have a problem. I wrote a python program to parse HTML files using the
HTMLParser and all that I need to do with the files can be done very
easily using this wonderful class. Kudos to the authors!

My problem is that one of the sets of files that I'm trying to parse has
badly formatted tables. Now when I do a view source on the files I can
see the problem clearly. It's quite simple, the tables in these files
starts out properly formatted but after a standard header the script
than generates them leaves out the <TR> tag. This is incredible to me
because both Netscape and IE read can view the tables properly. But the
HTMLParser dies on them.

Let me be a little more explicit. The problem files, and there are over
300 of them, have the following structure to their tables.

<TABLE>
<TR>
<TD> foo</TD> <TD>bar</TD>
</TR>
<TD> 1</TD><TD>2</TD></TR>
<TD>3</TD><TD>4</TD></TR>
</TABLE>

Incredibly, this will display correctly (or rather incorrectly as is
shouldn't display at all) in most browsers.

What I need to do to fix this is run a quick pre-processor and using the
re module replace all occurences of

    </TR> junk </TR> with
    </TR><TR> junk </TR>
where junk does not contain the tag <TR>

Can anyone tell me what the re for this is? I can't seem to get anything
to work. right now.

Regards,

Raoul-Sam




More information about the Python-list mailing list