HTMLParser bug ?

Grzegorz Adam Hankiewicz gradha at titanium.sabren.com
Thu May 8 15:31:44 EDT 2003


On 2003-05-08, Anand B Pillai <abpillai at lycos.com> wrote:
> I am developing a web spider program in pure python.  I am using
> the HTMLParser module in the python standard distribution. (The
> stand-alone HTMLParser, not the htmllib.HTMLParser)
> 
> I have found some bugs with this module.  Here is a very simple
> one.  For the following html data, [...]

Using w3's validator:
"""
This page is not Valid HTML 4.01 Transitional!

   Below are the results of attempting to parse this document with
   an SGML parser.

    Line 8, column 18: character "," not allowed in attribute
    specification list (explain...).

   <font face="Arial", size=5>Paragraph 1</font>
"""
> HTMLParser gives the following error.  "malformed start tag, at
> line 8, column 19" [...]  This rendered many webpages faulty for
> my spider program.  So I have made the following modification in
> HTMLParser and it works.

Of course, but since it seems it is malformed HTML you might as
well correct the HTML. If you can't, and still must process that
HTML, please google for mxTidy. It's an html cleanup module you
can use on the input data before doing your processing. So far,
HTMLParser has not given me any problems with `tidied' data.

-- 
 Please don't send me private copies of your public answers. Thanks.





More information about the Python-list mailing list