the tostring and XML methods in ElementTree

Stefan Behnel stefan.behnel-n05pAM at web.de
Fri May 19 10:19:46 EDT 2006


George Sakkis wrote:
> Fredrik Lundh wrote:
> 
>> mirandacascade at yahoo.com wrote:
>>
>>> I wanted to see what would happen if one used the results of a tostring
>>> method as input into the XML method.  What I observed is this:
>>> a) beforeCtag.text is of type <type 'str'>
>>> b) beforeCtag.text when printed displays: I'm confused
>>> c) afterCtag.text is of type <type 'unicode'>
>>> d) afterCtag.text when printed displays: I?m confused
>> the XML file format isn't a Python string serialization format, it's an XML infoset
>> serialization format.
>>
>> as stated in the documentation, ET always uses Unicode strings for text that
>> contain non-ASCII characters.  for text that *only* contains ASCII, it may use
>> either Unicode strings or 8-bit strings, depending on the implementation.
>>
>> the behaviour if you're passing in non-ASCII text as 8-bit strings is undefined
>> (which means that you shouldn't do that; it's not portable).
> 
> I was about to post a similar question when I found this thread.
> Fredrik, can you explain why this is not portable ?

Because there is no such things as a default encoding for 8-bit strings.


> I'm currently using
> (a variation of) the workaround below instead of ET.tostring and it
> works fine for me:
> 
> def tostring(element, encoding=None):
>     text = element.text
>     if text:
>         if not isinstance(text, basestring):
>             text2 = str(text)
>         elif isinstance(text, str) and encoding:
>             text2 = text.decode(encoding)
>         element.text = text2
>     s = ET.tostring(element, encoding)
>     element.text = text
>     return s
> 
> 
> Why isn't this the standard behaviour ?


Because it wouldn't work. What if you wanted to serialize a different encoding
than that of the strings you put into the .text fields? How is ET supposed to
know what encoding your strings have? And how should it know that you didn't
happily mix various different byte encodings in your strings?

Use unicode, that works *and* is portable.

Stefan



More information about the Python-list mailing list