HTMLParser bug ?
Anand B Pillai
abpillai at lycos.com
Thu May 8 09:41:53 EDT 2003
I am developing a web spider program in pure python.
I am using the HTMLParser module in the python standard
distribution. (The stand-alone HTMLParser, not the htmllib.HTMLParser)
I have found some bugs with this module.
Here is a very simple one.
For the following html data,
------------------------------------------------------------------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type"
content="text/html; charset=ISO-8859-1">
<title>Test Page</title></head>
<body bgcolor=#FFFFFF>
<p>
<font face="Arial", size=5>Paragraph 1</font>
</p>
</body>
</html>
---------------------------------------------------------------------
HTMLParser gives the following error.
"malformed start tag, at line 8, column 19"
The parser stops parsing after this. The error comes from
the "," character inside the <font> tag. HTMLParser thinks
it is a fresh start tag and throws the error.
This rendered many webpages faulty for my spider program.
So I have made the following modification in HTMLParser
and it works.
L269: + if end not in (",", ">", "/>"):
L297: + if next == ',':
return j + 1
This was a quick hack and I found that it works in this
example and a couple of other cases.
Regards,
Anand Pillai
More information about the Python-list
mailing list