Python parsing iTunes XML/COM

Thu Jul 31 09:44:57 EDT 2008

Stefan Behnel <stefan... at behnel.de> wrote:
> william tanksley wrote:
> > Okay, my answer is that ElementTree (in Python 2.5) is simply
> > deranged when it comes to Unicode. It assumes everything's ASCII.

> It does not "assume" that. It *requires* byte strings to be ASCII.

You can't encode Unicode into an ASCII string. (Well, except using
UTF-7.) Bad requirement.

> If it
> didn't enforce that, how could it possibly know what encoding they were using,
> i.e. what they were supposed to mean at all? Read the Python Zen, in the face
> of ambiguity, ElementTree refuses the temptation to guess. Python 2.x does
> exactly the same thing when it comes to implicit conversion between encoded
> strings and Unicode strings.

An XML file that begins with the string <?xml encoding="utf-8"?> is
NOT ascii. You don't have to guess what encoding it's in. It's UTF-8.
If you error out when you hit an 8-bit character, you're not going to
be able to process that file. I'm completely lost on why you're
claiming otherwise.

Furthermore, when ElementTree returns (from one of its .text elements)
a string-of-bytes instead of a decoded Unicode string, it doesn't
merely "resist the temptation to guess"; instead, it forces ME to
guess. I've now had to hardcode "utf-8" into my program, when IT just
bypassed and ignored an explicit instruction to use UTF-8. I hope and
assume that iTunes will never switch from UTF-8 to UTF-32 -- if it
does, my code breaks, and I'll probably have to switch away from
ElementTree (I guess that since it requires ASCII it won't even
pretend to handle more than 8 bits per character).

> If you want to pass plain ASCII strings, you can either pass a byte string or
> a Unicode string (that's a plain convenience feature). If you want to pass
> anything that's not ASCII, you *must* pass a Unicode string.

I don't care about strings. I've never passed ElementTree a string.
I'm using a file, a file that's correctly encoded as UTF-8, and it
returns some text elements that are raw bytes (undecoded). I have to
manually decode them.

> > Reference:http://codespeak.net/lxml/compatibility.html

> > (Note that the lxml version also doesn't handle Unicode correctly; it
> > errors when XML declares its encoding.)

> It definitely does "handle Unicode correctly".

Actually, this is my bad -- I misread the webpage. lxml appears to
handle unicode strings with a declared encoding correctly: it errors
out. That's quite reasonable when confronted with a contradiction.
According to that page, however, the standard ElementTree library
doesn't work that way -- it simply assumes that byte strings are
ASCII.

I'm going to back down on this one, though. I realize that this is a
single paragraph on a third-party website, and it's not really trying
to document the official ElementTree (it's trying to document its own
version, lxml). So it might not be correct, or it might be overly
ambiguous. It might also be talking ONLY about strings, to the
exclusion of file input. I don't know, and I don't have the energy to
debug it, especially since I can't "fix" anything about it even if
something was wrong :-).

So I revert to my former position: I don't know why those two lines
have to be in that order for my code to work correctly; I don't even
know why the "encode" line has to be there at all. When I was using
the old Python XML library, I didn't have to worry about encoding or
decoding; everything just worked. I really prefer ElementTree, and I'm
glad I upgraded, but it really looks like encoding is a problem.

> Let me guess, you tried passing
> XML as a Unicode string into the parser, and your XML declared itself as
> having a byte encoding (<?xml encoding="..."?>). How can that *not* be an error?

I thought you just said "resist the temptation to guess"? I didn't
pass a string. I passed a file. It didn't error out; instead, it
produced bytestring-encoded output (not Unicode).

> Stefan

-Wm