the tostring and XML methods in ElementTree

Fri May 19 18:21:41 EDT 2006

George Sakkis wrote:
> > > I'm currently using
> > > (a variation of) the workaround below instead of ET.tostring and it
> > > works fine for me:
> > >
> > > def tostring(element, encoding=None):
> > >     text = element.text
> > >     if text:
> > >         if not isinstance(text, basestring):
> > >             text2 = str(text)
> > >         elif isinstance(text, str) and encoding:
> > >             text2 = text.decode(encoding)
> > >         element.text = text2
> > >     s = ET.tostring(element, encoding)
> > >     element.text = text
> > >     return s
> > >
> > >
> > > Why isn't this the standard behaviour ?
> >
> >
> > Because it wouldn't work. What if you wanted to serialize a different encoding
> > than that of the strings you put into the .text fields? How is ET supposed to
> > know what encoding your strings have? And how should it know that you didn't
> > happily mix various different byte encodings in your strings?
>
> If you're mixing different encodings, no tool can help you clean up the
> mess, you're on your own. This is very different though from having
> nice utf-8 strings everywhere, asking ET.tostring explicitly to print
> them in utf-8 and getting back garbage. Isn't the most reasonable
> assumption that the input's encoding is the same with the output, or
> does this fall under the "refuse the temptation to guess" motto ? If
> this is the case, ET could at least accept an optional input encoding
> parameter and convert everything to unicode internally.

This is an optimization. Basically you're delaying decoding. First of
all have you measured the impact on your program if you delay decoding?
I'm sure for many programs it doesn't matter, so what you're proposing
will just pollute their source code with optimization they don't need.
That doesn't mean it's a bad idea in general. I'd prefer it implemented
in python core with minimal impact on such programs, decoding delayed
until you try to access individual characters. The code below can be
implemented without actual decoding:

utf8_text_file.write("abc".decode("utf-8") + " def".decode("utf-8"))

But this example will require decoding done during split method:

a = ("abc".decode("utf-8") + " def".decode("utf-8")).split()

> > Use unicode, that works *and* is portable.
>
> *and* it's not supported by all the 3rd party packages, databases,
> middleware, etc. you have to or want to use.

You can always call .encode method. Granted that could be a waste of
CPU and memory, but it works.