HTMLParser chokes on bad end tag in comment

Edward Elliott nobody at 127.0.0.1
Mon May 29 15:21:36 EDT 2006


Fredrik Lundh wrote:

>> Should it? The end tag it chokes on is in comment, isn't it?
> 
> no.  STYLE and SCRIPT elements contain character data, not parsed
> character data, so comments are treated as characters, and the first
> "</" ends the element.

Rather than take your word for it, I checked the W3C HTML4 DTD and found
this:

http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data

Element content 

When script or style data is the content of an element (SCRIPT and STYLE),
the data begins immediately after the element start tag and ends at the
first ETAGO ("</") delimiter followed by a name start character ([a-zA-Z]);
note that this may not be the element's end tag. Authors should therefore
escape "</" within the content. Escape mechanisms are specific to each
scripting or style sheet language.

ILLEGAL EXAMPLE:
The following script data incorrectly contains a "</" sequence (as part of
"</EM>") before the SCRIPT end tag:

    <SCRIPT type="text/javascript">
      document.write ("<EM>This won't work</EM>")
    </SCRIPT>

In JavaScript, this code can be expressed legally by hiding the ETAGO
delimiter before an SGML name start character:

    <SCRIPT type="text/javascript">
      document.write ("<EM>This will work<\/EM>")
    </SCRIPT>


Guess you learn something new every day.  Too bad there's so much illegal
code in the wild. :(

-- 
Edward Elliott
UC Berkeley School of Law (Boalt Hall)
complangpython at eddeye dot net



More information about the Python-list mailing list