ElementTree, XML and Unicode -- C0 Controls

Mon Dec 11 10:24:43 EST 2006

Hi all,

The unicode code points in the 0000-001F range --
except newline, tab, carriage return -- are not legal
XML 1.0 characters.

Attempts to serialize and deserialize such strings
with ElementTree will fail:

>>> elt = Element("root", char=u"\u0000")
>>> xml = tostring(elt)
>>> xml
'<root char="\x00" />'
>>> fromstring(xml)
   [...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 12

Good ! But I was expecting a failure *earlier*, in
the "tostring" function -- I basically assumed that
ElementTree would refuse to generate a XML
fragment that is not well-formed.

Could anyone comment on the rationale behind
the current behavior ? Is it a performance issue,
the search for non-valid unicode code points being
too expensive ?

Cheers,

SB