[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

Fri Mar 12 10:00:48 CET 2010

Fredrik Lundh <fredrik at effbot.org> added the comment:

"'None' has always been the documented default for the encoding parameter"

That's probably mostly by accident at least in original ET, but the 1.3 draft docs at effbot.org/elementtree does spell it out explicitly for the 'write' method:

   Output encoding. If omitted or set to None, defaults to US-ASCII.

Not sure I'd consider this text binding in itself, though (even if I'd argue that it's preferred to have the same interpretation of encoding everywhere).

"writing out the Unicode serialisation will result in an incorrect XML serialisation"

I think Guido meant the ElementTree.write method; is that broken too?

The file.write(et.tostring()) issue is probably my most pressing concern here; that's a common use case (e.g. when using "iterparse" to cut pieces from a big document), and the defaults were chosen to increase the chance that this automatically do the right thing for non-ASCII even if the programmer never tests it.  In 3.X, that construct is suddenly dependent on the interpreter's default encoding.

I think I'd prefer old "tostring" behaviour and a separate "tounicode" function, and I'm still not convinced that the latter is required for the XML use case (which implies that maybe it should live in lxml.html for the HTML case, even if it ends up calling the same internal implementation).

Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8047>
_______________________________________