[issue5498] Can SGMLParser properly handle <empty/> tags?

Éric Araujo report at bugs.python.org
Wed Jan 27 14:40:38 CET 2010


Éric Araujo <merwok at netwok.org> added the comment:

Hello

XML of the form <tag/> are an SGML hack, or more precisely the combination of two features of SGML. The forward slash closes the tag, and the following angle bracket is character data, not part of the tag.

The W3C validator  uses a real SGML parser for HTML doctypes, and fails on XML-like /> constructs: http://validator.w3.org/check?uri=data%3Atext%2Fhtml%2C%3C!DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+HTML+4.01%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fhtml4%2Fstrict.dtd%22%3E+%3Chtml%3E+%3Chead%3E+++%3Ctitle%3ETest%3C%2Ftitle%3E+++%3Cmeta+name%3Dtest+content%3Done%2F%3E+++%3Cmeta+name%3Dbug+content%3Dtwo%3E+%3C%2Fhead%3E+%3Cbody%3E+++%3Cp%3ETest%3C%2Fp%3E+%3C%2Fbody%3E+%3C%2Fhtml%3E&charset=%28detect+automatically%29&doctype=Inline&group=1&verbose=1

The complete explanation can be read at http://www.cs.tut.fi/~jkorpela/html/empty.html

In conclusion, sgmllib is right. Use an XML parser for XML or an HTML5 parser for HTML.

Kind regards

----------
nosy: +Merwok

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5498>
_______________________________________


More information about the Python-bugs-list mailing list