HTMLParser and Quotes

Richard West rwest2 at opti.cgi.net
Thu Jan 2 16:06:57 EST 2003


Thank you!  Fortunately my app is not high volume.  This should do
nicely.

-Richard


On Thu, 02 Jan 2003 13:43:04 -0700, Andrew Dalke
<adalke at mindspring.com> wrote:

>Richard Brodie:
>> HTMLParser is a fairly straightforward parser: it mostly follows the SGML
>> syntax rules. That means that it is of little use for most of the HTML out on
>> the web. Whilst an DWIM parser might be useful, it could get out of hand,
>> and I'm fairly happy that the standard library one stops on the first error.
>> In a few years the XML ones will error anyway.
>
>In the meanwhile, you can use something like HTML Tidy
>   http://tidy.sourceforge.net/
>and  Marc-André Lemburg Python interface to it, mxTidy
>   http://www.lemburg.com/files/python/mxTidy.html
>to clean up input HTML, like this
>
> >>> from mx import Tidy
> >>> from HTMLParser import HTMLParser
> >>> text = """<html>
>... <body>
>... <font face=arial,helvetica>test</font>
>... </body>
>... </html>"""
> >>>
> >>> print Tidy.Tidy.tidy(text)[2]
><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
><html>
><head>
><title></title>
></head>
><body>
><font face="arial,helvetica">test</font>
></body>
></html>
>
> >>>
> >>> x = HTMLParser()
>
>					Andrew
>					dalke at dalkescientific.com





More information about the Python-list mailing list