iterparse and unicode

Thu Aug 21 01:48:57 EDT 2008

George Sakkis wrote:

 > Thank you both for the suggestions. I made a few more experiments to
 > understand how iterparse behaves with respect to three dimensions:

Spending time researching undefined behaviour is pretty pointless.  ET 
parsers expect byte streams, because that's what XML files are.  If you 
pass it anything else, an ET implementation may attempt to convert that 
thing to a byte string, run the game "rogue", or do something else that 
it finds appropriate.

> It's interesting that the element text attributes after a successful
> parse do not necessarily have the same type, i.e. all be str or all
> unicode. I ported some text extraction code from  BeautifulSoup (which
> handles all text as unicode) and I was surprized to find out that in
> xml.etree the returned text's type is not fixed, even within the same
> file. Although it's not a bug, having a mixed collection of byte and
> unicode strings from the same source makes me somewhat uneasy.

If you don't care about memory and execution performance, there are 
plenty of toolkits that guarantee that you always get Unicode strings.

</F>