[python-committers] The evolution of HTMLParser

Wed Nov 20 20:57:26 CET 2013

Hi,
during the last few months/years I worked on HTMLParser and the html
package -- this is a brief summary of what happened, what I'm working
on for 3.4, and what are the plans for future, and I'm looking for
some feedback about the latter.

When I first looked at it, the parser was able to parse most valid
pages, and had some heuristic to work around common mistakes, but was
raising exceptions on most broken pages.  On Python 3.2 it also had a
new strict argument that when set to False would allow the parser to
try to handle some more broken markup.

Last year HTML5 became a "Candidate Recommendation", and a lengthy
specification with detailed algorithms to handle both valid and
invalid markup was released [0].  Since then I worked on converging
HTMLParser to the HTML5 standard, while trying to remain backward
compatible and, where necessary, provide warnings for the changes I
was making.
Since the HTML5 standard specifies how to handle broken markup and
since the strict mode of HTMLParser is not strict enough to be used to
validate markup, I decided to deprecate it in 3.3 and remove it in 3.5
[2].
Currently the parser is able to handle horribly broken markup and (in
theory) should never raise errors while parsing HTML.  The result it
produces is really close to what the standard says and what the
browser does (I intentionally ignore a few obscure corner cases to
keep the code relatively simple/fast).  This is true for both 2.7 and
3.x (you can try to break it and report any failures you might
encounter).
Python 3.3 also comes with the list of HTML 5 entities
(html.entities.html5), and 3.4 will have an html.unescape() function
to convert them to the corresponding Unicode characters.

Now I'm working on #13633 (Automatically convert character references
in HTMLParser [1]), and I'm planning to add a convert_charrefs boolean
flag to the constructors that, when set to True, will automatically
convert charrefs (e.g. """, """) to the corresponding Unicode
characters, and avoid calling the handle_charref/handle_entityref
methods.
Since in my opinion this behavior is preferable, I am thinking about
switching the default to True in 3.5/3.6 and add a warning to 3.4 that
warns the user to either set convert_charrefs explicitly or be ready
to a behavior change in 3.5/3.6.  This means that HTMLParser will see
the warning and will have to set the flag, and they will be able to
remove it in 3.5/3.6, when the default will be True and warning will
be gone.

Do you think this would be acceptable?  If not, can you think any
better way to do it?

After this, most of the work on HTMLParser and the html package should
be done.  I plan to update the documentation and say that the parser
is (almost) compliant with HTML 5, phase out the deprecated "strict"
argument [2] and eventually the warning about convert_charrefs, and
possibly do some optimizations/clean ups.  There's also an open issue
to add a generator-based API [3], but that's a major change and I need
more time to think about it.

Best Regards,
Ezio Melotti

[0]: http://www.w3.org/TR/html5/syntax.html
[1]: http://bugs.python.org/issue13633 - Automatically convert
character references in HTMLParser
[2]: http://bugs.python.org/issue15114 - Deprecate strict mode of HTMLParser
[3]: http://bugs.python.org/issue17410 - Generator-based HTMLParser