iterparse and unicode

Mon Aug 25 16:19:37 EDT 2008

On Aug 24, 1:12 am, Stefan Behnel <stefan... at behnel.de> wrote:
> George Sakkis wrote:
> > On Aug 21, 1:48 am, Fredrik Lundh <fred... at pythonware.com> wrote:
>
> >> George Sakkis wrote:
> >>> It's interesting that the element text attributes after a successful
> >>> parse do not necessarily have the same type, i.e. all be str or all
> >>> unicode. I ported some text extraction code from  BeautifulSoup (which
> >>> handles all text as unicode) and I was surprized to find out that in
> >>> xml.etree the returned text's type is not fixed, even within the same
> >>> file. Although it's not a bug, having a mixed collection of byte and
> >>> unicode strings from the same source makes me somewhat uneasy.
> >> If you don't care about memory and execution performance, there are
> >> plenty of toolkits that guarantee that you always get Unicode strings.
>
> > As long as they are documented, both approaches are fine for different
> > cases. Currently the only reference I found about unicode in
> > ElementTree is "All strings can either be Unicode strings, or 8-bit
> > strings containing US-ASCII only." [1], which is rather ambiguous
>
> It's not ambiguous in Py2.x, where ASCII byte strings and unicode strings are
> compatible. No need to feel "uneasy". :)

It depends on what you mean by "compatible"; e.g. you can't safely do
[s.decode('utf8') for s in strings] if you have byte strings mixed
with unicode.

George