[issue13576] Handling of broken condcoms in HTMLParser

Ezio Melotti report at bugs.python.org
Sun Dec 11 02:58:35 CET 2011


New submission from Ezio Melotti <ezio.melotti at gmail.com>:

The attached patch adds a few tests about the handling of broken conditional comments (condcoms).
A valid condcom looks like <!--[if ie 6]>...<![endif]-->.
An invalid one looks like <![if ie 6]>...<![endif]>.
This seems a common mistake, and it's found even on popular sites like adobe, linkedin, deviantart.

Currently, HTMLParser calls unknown_decl() passing e.g. 'if ie 6', and if strict=True an error is raised.  With strict=False no error is raised and the unknown declaration is ignored.

The HTML5 specs say:
"""
[After '<!',] If the next two characters are both U+002D HYPHEN-MINUS characters (-), consume those two characters, [...]
Otherwise, this is a parse error. Switch to the bogus comment state.[0]

[Once in the bogus comment state,] Consume every character up to and including the first U+003E GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever comes first. Emit a comment token whose data is the concatenation of all the characters starting from and including the character that caused the state machine to switch into the bogus comment state, up to and including the character immediately before the last consumed character (i.e. up to the character just before the U+003E or EOF character), but with any U+0000 NULL characters replaced by U+FFFD REPLACEMENT CHARACTER characters. (If the comment was started by the end of the file (EOF), the token is empty.)[1]
"""

So, IIUC, '<![if ie 6]>...<![endif]>' should emit a '[if ie 6]' comment, parse the '...' normally, and emit a '[endif]' comment.

However I think it's fine to leave the current behavior for the following reasons:
  1) backward compatibility;
  2) handling broken condcoms in unknown_decl is easier than doing it in handle_comment, where all the other comments are sent;
  3) no one probably cares about them anyway;

[0]: http://www.w3.org/TR/html5/tokenization.html#markup-declaration-open-state
[1]: http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state

----------
assignee: ezio.melotti
components: Library (Lib)
files: issue13576.diff
keywords: patch
messages: 149204
nosy: eric.araujo, ezio.melotti
priority: normal
severity: normal
stage: commit review
status: open
title: Handling of broken condcoms in HTMLParser
type: behavior
versions: Python 2.7, Python 3.2, Python 3.3
Added file: http://bugs.python.org/file23909/issue13576.diff

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13576>
_______________________________________


More information about the Python-bugs-list mailing list