[Python-Dev] HTMLParser and HTML5

Thu Jul 28 23:49:14 CEST 2011

On Thu, Jul 28, 2011 at 11:25, Matt <mattbasta at gmail.com> wrote:

> Hello all,
>
> I wanted to ask a few questions and start a discussion about HTML5
> support within the HTMLParser class(es). Over on issue 670664, an
> inconsistency with the way browsers and the HTMLParser parse script
> and style tags was discovered. Currently, HTMLParser adheres strictly
> to the HTML4 standard, which says that these tags should exit CDATA
> mode when the start of *any* closing tag is found. No browsers, to my
> knowledge, have ever supported this (at least in the 21st century).
> Instead, all browsers implement the behavior described in the HTML5
> spec, which states that script tags should exit their "raw text mode"
> when the full closing tag for that element is encountered.
>
> The repercussions of adhering to the HTML4 standard in HTMLParser are
> somewhat serious: a good number of documents will either encounter
> exceptions for broken markup (which aren't actually broken). Libraries
> like Beautiful Soup (which depend on HTMLParser) are also affected,
> requiring the use of hacks just to get the document to parse at all.
>
> Rather than bore you all with another paragraph about how HTML4 is
> terrible, feel free to look at the issue
> (http://bugs.python.org/issue670664), which quite thoroughly outlines
> the pros and cons of this particular change. Any feedback/input  on
> the proposed changes is welcome.
>
> So here are my questions:
>
> - What plans, if any, are there to support HTML5 parsing behaviors,
> since the HTML5 spec effectively describes current web browser
> behavior?
>

There are not specific plans that have been publicly brought up (to my
knowledge).

> - What policies are in place for keeping parity with other HTML
> parsers (such as those in web browsers)?
>

There aren't any beyond "it would be nice".

>
> Given the semi-backward-compatible nature of HTML5's syntax, this
> seems like a rather unique problem that could use some more
> discussion.
>

It's more of an issue of someone caring enough to do the coding work to
bring the parser up to spec for HTML5 (or introduce new code to live beside
the HTML4 parsing code). IOW there is no policies specifically about this
topic beyond the general desire to stay up-to-date with stable specs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110728/c69ecb46/attachment.html>