[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

Wed Jul 27 18:53:53 CEST 2011

Matt Basta <bastawhiz at gmail.com> added the comment:

> So I think the example is invalid (should escape the <), and that HTMLParser is not buggy.

On the other hand, the HTML5 spec clearly dictates otherwise:

http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions
The text in raw text and RCDATA elements must not contain any occurrences of the string "</" (U+003C LESS-THAN SIGN, U+002F SOLIDUS) followed by characters that case-insensitively match the tag name of the element followed by one of U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), U+0020 SPACE, U+003E GREATER-THAN SIGN (>), or U+002F SOLIDUS (/).

Additionally, no browsers (perhaps unless they are in quirks mode) currently obey the HTML4 variant of the rule. This is due largely in part to the need to include strings such as "</scr" + "ipt>" within a script tag itself. This behavior can be observed firsthand by loading this snippet in a browser:

<script><span></span>This should not be visible.</script>

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue670664>
_______________________________________