iterparse and unicode

Stefan Behnel stefan_ml at behnel.de
Sun Aug 24 01:12:01 EDT 2008


George Sakkis wrote:
> On Aug 21, 1:48 am, Fredrik Lundh <fred... at pythonware.com> wrote:
> 
>> George Sakkis wrote:
>>> It's interesting that the element text attributes after a successful
>>> parse do not necessarily have the same type, i.e. all be str or all
>>> unicode. I ported some text extraction code from  BeautifulSoup (which
>>> handles all text as unicode) and I was surprized to find out that in
>>> xml.etree the returned text's type is not fixed, even within the same
>>> file. Although it's not a bug, having a mixed collection of byte and
>>> unicode strings from the same source makes me somewhat uneasy.
>> If you don't care about memory and execution performance, there are
>> plenty of toolkits that guarantee that you always get Unicode strings.
> 
> As long as they are documented, both approaches are fine for different
> cases. Currently the only reference I found about unicode in
> ElementTree is "All strings can either be Unicode strings, or 8-bit
> strings containing US-ASCII only." [1], which is rather ambiguous

It's not ambiguous in Py2.x, where ASCII byte strings and unicode strings are
compatible. No need to feel "uneasy". :)

Stefan



More information about the Python-list mailing list