the tostring and XML methods in ElementTree

Fri May 19 11:27:13 EDT 2006

Stefan Behnel wrote:

> George Sakkis wrote:
> > Fredrik Lundh wrote:
> >
> >> mirandacascade at yahoo.com wrote:
> >>
> >>> I wanted to see what would happen if one used the results of a tostring
> >>> method as input into the XML method.  What I observed is this:
> >>> a) beforeCtag.text is of type <type 'str'>
> >>> b) beforeCtag.text when printed displays: I'm confused
> >>> c) afterCtag.text is of type <type 'unicode'>
> >>> d) afterCtag.text when printed displays: I?m confused
> >> the XML file format isn't a Python string serialization format, it's an XML infoset
> >> serialization format.
> >>
> >> as stated in the documentation, ET always uses Unicode strings for text that
> >> contain non-ASCII characters.  for text that *only* contains ASCII, it may use
> >> either Unicode strings or 8-bit strings, depending on the implementation.
> >>
> >> the behaviour if you're passing in non-ASCII text as 8-bit strings is undefined
> >> (which means that you shouldn't do that; it's not portable).
> >
> > I was about to post a similar question when I found this thread.
> > Fredrik, can you explain why this is not portable ?
>
> Because there is no such things as a default encoding for 8-bit strings.
>
>
> > I'm currently using
> > (a variation of) the workaround below instead of ET.tostring and it
> > works fine for me:
> >
> > def tostring(element, encoding=None):
> >     text = element.text
> >     if text:
> >         if not isinstance(text, basestring):
> >             text2 = str(text)
> >         elif isinstance(text, str) and encoding:
> >             text2 = text.decode(encoding)
> >         element.text = text2
> >     s = ET.tostring(element, encoding)
> >     element.text = text
> >     return s
> >
> >
> > Why isn't this the standard behaviour ?
>
>
> Because it wouldn't work. What if you wanted to serialize a different encoding
> than that of the strings you put into the .text fields? How is ET supposed to
> know what encoding your strings have? And how should it know that you didn't
> happily mix various different byte encodings in your strings?

If you're mixing different encodings, no tool can help you clean up the
mess, you're on your own. This is very different though from having
nice utf-8 strings everywhere, asking ET.tostring explicitly to print
them in utf-8 and getting back garbage. Isn't the most reasonable
assumption that the input's encoding is the same with the output, or
does this fall under the "refuse the temptation to guess" motto ? If
this is the case, ET could at least accept an optional input encoding
parameter and convert everything to unicode internally.

> Use unicode, that works *and* is portable.

*and* it's not supported by all the 3rd party packages, databases,
middleware, etc. you have to or want to use.

> Stefan

George