[New-bugs-announce] [issue26009] HTMLParser lacking a few features to reconstruct input exactly

Mon Jan 4 12:35:38 EST 2016

New submission from Jason Sachs:

The HTMLParser class (https://docs.python.org/2/library/htmlparser.html) is lacking a few features to reconstruct input exactly. For the most part it can do this, but I found two items where it falls short (there may be others):

- There is a get_starttag_text() method but no get_endtag_text() method, which is necessary if the end tag is not in canonical form, e.g. instead of </p> it is </P> or </   P >

- The effect of the parse_bogus_comment() internal method is to call handle_comment(), so content like <! I AM BOGUS > cannot be distinguished by subclasses of HTMLParser from actual comments <!-- I AM BOGUS -->

Suggested changes:

- Add a get_endtag_text() method to return the exact endtag text
- change parse_bogus_comment to call self.handle_bogus_comment(), and define self.handle_bogus_comment() to call self.handle_comment(). This way it is backwards-compatible with existing behavior, but subclasses can redefine self.handle_bogus_comment() to do what they want.

----------
messages: 257472
nosy: jason_s
priority: normal
severity: normal
status: open
title: HTMLParser lacking a few features to reconstruct input exactly
type: behavior
versions: Python 2.7

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26009>
_______________________________________