iterparse and unicode

George Sakkis george.sakkis at gmail.com
Thu Aug 21 07:34:33 EDT 2008


On Aug 21, 1:48 am, Fredrik Lundh <fred... at pythonware.com> wrote:

> George Sakkis wrote:
> > It's interesting that the element text attributes after a successful
> > parse do not necessarily have the same type, i.e. all be str or all
> > unicode. I ported some text extraction code from  BeautifulSoup (which
> > handles all text as unicode) and I was surprized to find out that in
> > xml.etree the returned text's type is not fixed, even within the same
> > file. Although it's not a bug, having a mixed collection of byte and
> > unicode strings from the same source makes me somewhat uneasy.
>
> If you don't care about memory and execution performance, there are
> plenty of toolkits that guarantee that you always get Unicode strings.

As long as they are documented, both approaches are fine for different
cases. Currently the only reference I found about unicode in
ElementTree is "All strings can either be Unicode strings, or 8-bit
strings containing US-ASCII only." [1], which is rather ambiguous; at
least I read it as "all strings are Unicode or all strings are 8-bit
strings", not a potentially mix of both in the same tree.

Regards,
George

[1] http://effbot.org/zone/element.htm



More information about the Python-list mailing list