HTMLParser and Quotes

Thu Jan 2 15:43:04 EST 2003

Richard Brodie:
> HTMLParser is a fairly straightforward parser: it mostly follows the SGML
> syntax rules. That means that it is of little use for most of the HTML out on
> the web. Whilst an DWIM parser might be useful, it could get out of hand,
> and I'm fairly happy that the standard library one stops on the first error.
> In a few years the XML ones will error anyway.

In the meanwhile, you can use something like HTML Tidy
   http://tidy.sourceforge.net/
and  Marc-André Lemburg Python interface to it, mxTidy
   http://www.lemburg.com/files/python/mxTidy.html
to clean up input HTML, like this

 >>> from mx import Tidy
 >>> from HTMLParser import HTMLParser
 >>> text = """<html>
... <body>
... <font face=arial,helvetica>test</font>
... </body>
... </html>"""
 >>>
 >>> print Tidy.Tidy.tidy(text)[2]
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<font face="arial,helvetica">test</font>
</body>
</html>

 >>>
 >>> x = HTMLParser()

					Andrew
					dalke at dalkescientific.com