Regular Expression Help for Python Newbie.

Fredrik Lundh effbot at telia.com
Sat Apr 8 04:52:33 EDT 2000


Raoul-Sam Daruwala wrote:
> I have a problem. I wrote a python program to parse HTML files using the
> HTMLParser and all that I need to do with the files can be done very
> easily using this wonderful class. Kudos to the authors!
>
> My problem is that one of the sets of files that I'm trying to parse has
> badly formatted tables. Now when I do a view source on the files I can
> see the problem clearly. It's quite simple, the tables in these files
> starts out properly formatted but after a standard header the script
> than generates them leaves out the <TR> tag. This is incredible to me
> because both Netscape and IE read can view the tables properly.

<TR>'s are optional -- if the browser stumbles upon <TD>
in a <TABLE> context, it should insert <TR>'s all by itself.

> What I need to do to fix this is run a quick pre-processor and using the
> re module replace all occurences of
>
>     </TR> junk </TR> with
>     </TR><TR> junk </TR>
> where junk does not contain the tag <TR>
>
> Can anyone tell me what the re for this is? I can't seem to get anything
> to work. right now.

if you cannot get it to work, how come you're so sure you
need to use the re module? ;-)

try using the sgmllib parser instead; here's an example from the
eff-bot guide (see below).  it shouldn't be that hard to tweak the
filter class to insert TR tags if it sees a TD following directly after
a TABLE.

# sgmllib-example-4.py

import sgmllib
import cgi, string, sys

class SGMLFilter(sgmllib.SGMLParser):
    # sgml filter.  override start/end to manipulate
    # document elements

    def __init__(self, outfile=None, infile=None):
        sgmllib.SGMLParser.__init__(self)
        if not outfile:
            outfile = sys.stdout
        self.write = outfile.write
        if infile:
            self.load(infile)

    def load(self, file):
        while 1:
            s = file.read(8192)
            if not s:
                break
            self.feed(s)
        self.close()

    def handle_entityref(self, name):
        self.write("&%s;" % name)

    def handle_data(self, data):
        self.write(cgi.escape(data))

    def unknown_starttag(self, tag, attrs):
        tag, attrs = self.start(tag, attrs)
        if tag:
            if not attrs:
                self.write("<%s>" % tag)
            else:
                self.write("<%s" % tag)
                for k, v in attrs:
                    self.write(" %s=%s" % (k, repr(v)))
                self.write(">")

    def unknown_endtag(self, tag):
        tag = self.end(tag)
        if tag:
            self.write("</%s>" % tag)

    def start(self, tag, attrs):
        return tag, attrs # override

    def end(self, tag):
        return tag # override

class Filter(SGMLFilter):

    def fixtag(self, tag):
        if tag == "em":
            tag = "i"
        if tag == "string":
            tag = "b"
        return string.upper(tag)

    def start(self, tag, attrs):
        return self.fixtag(tag), attrs

    def end(self, tag):
        return self.fixtag(tag)

c = Filter()
c.load(open("samples/sample.htm"))

</F>

<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->





More information about the Python-list mailing list