[issue670664] HTMLParser.py - more robust SCRIPT tag parsing
Ezio Melotti
report at bugs.python.org
Wed Jul 27 08:52:15 CEST 2011
Ezio Melotti <ezio.melotti at gmail.com> added the comment:
I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there).
I also did some tests with and without the patch:
>>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
... def handle_data(self, data): print 'data: %r' % data
...
>>> myhp = MyHP()
# without the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar' # this looks ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # where's the </p>?
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # some tags missing, 2 chunks received
data: 'bar'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo'
data: " '"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247
# with the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar' # ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # all the content is there, but why 2 chunks?
data: '</p>'
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # same as previous
data: '</p>'
data: '<span>bar'
data: '</span>'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo' # same
data: '</p>'
data: " '"
data: "</scr'+'ipt>"
data: "' <span>bar"
data: '</span>'
So my question is: is it normal that the data is passed to handle_data in chunks?
AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me.
If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of <style>, often false for <script>).
Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g. <script> document.write('<script>alert("foo")</script>')</script> -- but some browsers will fail with this too).
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue670664>
_______________________________________
More information about the Python-bugs-list
mailing list