HTMLParser bug ?

Anand B Pillai abpillai at lycos.com
Thu May 8 09:41:53 EDT 2003


I am developing a web spider program in pure python.
I am using the HTMLParser module in the python standard
distribution. (The stand-alone HTMLParser, not the htmllib.HTMLParser)

I have found some bugs with this module.
Here is a very simple one.
For the following html data,

------------------------------------------------------------------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type"
 content="text/html; charset=ISO-8859-1">
<title>Test Page</title></head>
<body bgcolor=#FFFFFF>
<p>
<font face="Arial", size=5>Paragraph 1</font>
</p>
</body>
</html>
---------------------------------------------------------------------

HTMLParser gives the following error.
"malformed start tag, at line 8, column 19"

The parser stops parsing after this. The error comes from
the "," character inside the <font> tag. HTMLParser thinks
it is a fresh start tag and throws the error.

This rendered many webpages faulty for my spider program.
So I have made the following modification in HTMLParser
and it works.

L269:  + if end not in (",", ">", "/>"):
L297:  + if next == ',':
                return j + 1

This was a quick hack and I found that it works in this
example and a couple of other cases.

Regards,

Anand Pillai




More information about the Python-list mailing list