[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

Wed Jul 27 08:52:15 CEST 2011

Ezio Melotti <ezio.melotti at gmail.com> added the comment:

I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there).
I also did some tests with and without the patch:

>>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
...   def handle_data(self, data): print 'data: %r' % data
... 
>>> myhp = MyHP()

# without the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # this looks ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo'  # where's the </p>?
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # some tags missing, 2 chunks received
data: 'bar'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo'
data: " '"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247

# with the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # all the content is there, but why 2 chunks?
data: '</p>'
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # same as previous
data: '</p>'
data: '<span>bar'
data: '</span>'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")  
data: '<p>foo' # same
data: '</p>'
data: " '"
data: "</scr'+'ipt>"
data: "' <span>bar"
data: '</span>'

So my question is: is it normal that the data is passed to handle_data in chunks?
AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me.
If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of <style>, often false for <script>).
Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g. <script> document.write('<script>alert("foo")</script>')</script> -- but some browsers will fail with this too).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue670664>
_______________________________________