[issue6662] HTMLParser.HTMLParser doesn't handle malformed charrefs

Fri Aug 7 03:25:10 CEST 2009

New submission from Dave Day <dayveday at gmail.com>:

When HTMLParser.HTMLParser encounters a malformed charref (for example 
&#bad;) it no longer parsers the following HTML correctly.

For example:
  <p>&#bad;</p>
Recognises the starttag "p" but considers the rest to be data.

To reproduce:
class MyParser(HTMLParser.HTMLParser):
  def handle_starttag(self, tag, attrs):
    print 'Start "%s"' % tag
  def handle_endtag(self,tag):
    print 'End "%s"' % tag
  def handle_charref(self, ref):
    print 'Charref "%s"' % ref
  def handle_data(self, data):
    print 'Data "%s"' % data
parser = MyParser()
parser.feed('<p>&#bad;</p>')
parser.close()

Expected output:
Start "p"
Data "&#bad;"
End "p"

Actual output:
Start "p"
Data "&#bad;</p>"

----------
components: Library (Lib)
messages: 91392
nosy: dayveday
severity: normal
status: open
title: HTMLParser.HTMLParser doesn't handle malformed charrefs
type: behavior
versions: Python 2.4, Python 2.5

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue6662>
_______________________________________