Found a parsing bug in HTMLParser

Bengt Richter bokr at oz.net
Sun Feb 9 16:38:36 EST 2003


On Sun, 9 Feb 2003 18:06:56 +0100, Grzegorz Adam Hankiewicz <gradha at terra.es> wrote:

>Hi.
>
>I've found a bug in HTMLParser parsing some of my webpages. The
>problem is using an attribute with a value inside double quotes
>which is near another attribute. I've created a small testcase
Too "near" to be legal HTML 4.0, I believe. From the spec:
(http://www.w3.org/TR/1998/REC-html40-19980424)
"""
3.2.2 Attributes

Elements may have associated properties, called attributes, which may have values
(by default, or set by authors or scripts). Attribute/value pairs appear before
the final ">" of an element's start tag. Any number of (legal) attribute value pairs,
separated by spaces, may appear in an element's start tag. They may appear in any order.
^^^^^^^^^^^^^^^^^^^
"""
Your DTD specification is HTML 4.0, but even if it's trying to do new XHTML stuff,
XML requires a space before each attribute definition, i.e.,
from my XML spec copy of http://www.w3.org/TR/1998/REC-xml-19980210

    STag ::= '<' Name (S Attribute)* S? '>'
where 
    S    ::=  (#x20 | #x9 | #xD | #xA)

so it surprises me that you get an ok validation, though I'm not surprised
that browsers ignore anomalies.

>which you can see below. The w3c validator says the page is ok
>(http://validator.w3.org/check?uri=http://www.terra.es/personal7/gradha/test.html),
>and browsers render it without problems.  Does it happen with newer
>Python versions? What's the procedure for bug reports?
>
>PD: Don't CC me your replies.
>
>$ cat test.html
><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
><html><head><title>t</title>
><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
></head><body>
><a href="http://ss"title="pe">P</a>
                    ^^^^^^^^^^ -- need white space in front of this, e.g.,
 <a href="http://ss" title="pe">P</a>
></body></html>
>
>$ python
>Python 2.2.1 (#1, Apr 21 2002, 08:38:44)
>[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
>Type "help", "copyright", "credits" or "license" for more information.
>>>> from HTMLParser import HTMLParser
>>>> p = HTMLParser()
>>>> file = open("test.html", "rt")
>>>> p.feed("".join(file.readlines()))
>>>> file.close()
>>>> p.close()
>Traceback (most recent call last):
>  File "<stdin>", line 1, in ?
>  File "/usr/lib/python2.2/HTMLParser.py", line 112, in close
>    self.goahead(1)
>  File "/usr/lib/python2.2/HTMLParser.py", line 166, in goahead
>    self.error("EOF in middle of construct")
>  File "/usr/lib/python2.2/HTMLParser.py", line 115, in error
>    raise HTMLParseError(message, self.getpos())
>HTMLParser.HTMLParseError: EOF in middle of construct, at line 5, column 1
>
Seems like a better message could have been generated, though.
Regards,
Bengt Richter




More information about the Python-list mailing list