Regular Expression Help for Python Newbie.
Fredrik Lundh
effbot at telia.com
Sat Apr 8 04:52:33 EDT 2000
Raoul-Sam Daruwala wrote:
> I have a problem. I wrote a python program to parse HTML files using the
> HTMLParser and all that I need to do with the files can be done very
> easily using this wonderful class. Kudos to the authors!
>
> My problem is that one of the sets of files that I'm trying to parse has
> badly formatted tables. Now when I do a view source on the files I can
> see the problem clearly. It's quite simple, the tables in these files
> starts out properly formatted but after a standard header the script
> than generates them leaves out the <TR> tag. This is incredible to me
> because both Netscape and IE read can view the tables properly.
<TR>'s are optional -- if the browser stumbles upon <TD>
in a <TABLE> context, it should insert <TR>'s all by itself.
> What I need to do to fix this is run a quick pre-processor and using the
> re module replace all occurences of
>
> </TR> junk </TR> with
> </TR><TR> junk </TR>
> where junk does not contain the tag <TR>
>
> Can anyone tell me what the re for this is? I can't seem to get anything
> to work. right now.
if you cannot get it to work, how come you're so sure you
need to use the re module? ;-)
try using the sgmllib parser instead; here's an example from the
eff-bot guide (see below). it shouldn't be that hard to tweak the
filter class to insert TR tags if it sees a TD following directly after
a TABLE.
# sgmllib-example-4.py
import sgmllib
import cgi, string, sys
class SGMLFilter(sgmllib.SGMLParser):
# sgml filter. override start/end to manipulate
# document elements
def __init__(self, outfile=None, infile=None):
sgmllib.SGMLParser.__init__(self)
if not outfile:
outfile = sys.stdout
self.write = outfile.write
if infile:
self.load(infile)
def load(self, file):
while 1:
s = file.read(8192)
if not s:
break
self.feed(s)
self.close()
def handle_entityref(self, name):
self.write("&%s;" % name)
def handle_data(self, data):
self.write(cgi.escape(data))
def unknown_starttag(self, tag, attrs):
tag, attrs = self.start(tag, attrs)
if tag:
if not attrs:
self.write("<%s>" % tag)
else:
self.write("<%s" % tag)
for k, v in attrs:
self.write(" %s=%s" % (k, repr(v)))
self.write(">")
def unknown_endtag(self, tag):
tag = self.end(tag)
if tag:
self.write("</%s>" % tag)
def start(self, tag, attrs):
return tag, attrs # override
def end(self, tag):
return tag # override
class Filter(SGMLFilter):
def fixtag(self, tag):
if tag == "em":
tag = "i"
if tag == "string":
tag = "b"
return string.upper(tag)
def start(self, tag, attrs):
return self.fixtag(tag), attrs
def end(self, tag):
return self.fixtag(tag)
c = Filter()
c.load(open("samples/sample.htm"))
</F>
<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->
More information about the Python-list
mailing list