[Python-Dev] HTMLParser and HTML5

Thu Jul 28 20:25:15 CEST 2011

Hello all,

I wanted to ask a few questions and start a discussion about HTML5
support within the HTMLParser class(es). Over on issue 670664, an
inconsistency with the way browsers and the HTMLParser parse script
and style tags was discovered. Currently, HTMLParser adheres strictly
to the HTML4 standard, which says that these tags should exit CDATA
mode when the start of *any* closing tag is found. No browsers, to my
knowledge, have ever supported this (at least in the 21st century).
Instead, all browsers implement the behavior described in the HTML5
spec, which states that script tags should exit their "raw text mode"
when the full closing tag for that element is encountered.

The repercussions of adhering to the HTML4 standard in HTMLParser are
somewhat serious: a good number of documents will either encounter
exceptions for broken markup (which aren't actually broken). Libraries
like Beautiful Soup (which depend on HTMLParser) are also affected,
requiring the use of hacks just to get the document to parse at all.

Rather than bore you all with another paragraph about how HTML4 is
terrible, feel free to look at the issue
(http://bugs.python.org/issue670664), which quite thoroughly outlines
the pros and cons of this particular change. Any feedback/input  on
the proposed changes is welcome.

So here are my questions:

- What plans, if any, are there to support HTML5 parsing behaviors,
since the HTML5 spec effectively describes current web browser
behavior?
- What policies are in place for keeping parity with other HTML
parsers (such as those in web browsers)?

Given the semi-backward-compatible nature of HTML5's syntax, this
seems like a rather unique problem that could use some more
discussion.

Thanks

Matt Basta