Python parsing iTunes XML/COM

Thu Jul 31 17:44:51 EDT 2008

John Machin <sjmac... at lexicon.net> wrote:
> william tanksley <wtanksle... at gmail.com> wrote:
> Let's try again:

Cool. Sorry for the misunderstanding. Thank you for helping again!

Postscript: your request to print the actual data did the trick. I'm
including the rest of my reply just to provide context, but the answer
was the the Unicode was actually embedded in the URL, encoded as
distinct bytes. Thus, it *had* to be url-decoded and then UTF-8
decoded, in that order, in order to recover the original filename.

So the problem was indeed purely in my head -- I should have looked at
the original data (unfortunately, I was fooled by looking at the song
title, which is the same thing but with the raw UTF-8 bytes instead of
the URL escape codes).

> >> track_id = url2pathname(urlparse(track_id).path)
> >> print repr(track_id)
> >> parse_result = urlparse(track_id).path
> >> print repr(parse_result)
> >> track_id_replacement = url2pathname(parse_result)
> >> print repr(track_id_replacement)
> > The "important" value here is track_id_replacement; it contains the
> > data that's throwing me. It appears that some UTF-8 characters are
> > being read as multiple bytes by ElementTree rather than being decoded
> > into Unicode.
> > Here's one example. The others are similar -- they have the same
> > things that look like problems to me.
> > "Buffett Time - Annual Shareholders\xc2\xa0L.mp3"

> ROTFL! I thought the Buffett thing was a Windows filename! What I was
> expecting was THREE lots of repr() output, and I'm quite unused to
> seeing repr() output with quotes around it instead of apostrophes; how
> did you achieve that?

I don't know -- but I got it again when I printed out the original
version. My *guess* would be that this is what repr prints when asked
to print a byte string (but I don't know how to confirm that).
Alternately, the fact that I'm running these inside SPE might be
changing some defaults. I'm not sure.

You're right that single quotes are expected -- and I'd expect a
preceding u, since they're supposed to be Unicode. I dunno what's
going on.

> So you're saying that track_id_replacement contains utf8 characters.
> It is obtained by track_id_replacement = url2pathname(parse_result).
> You don't show us what is in parse_result. url2pathname() is nothing
> to do with ElementTree. urlparse() is nothing to do with ElementTree.
> You have provided no evidence that ElementTree is doing what you
> accuse it of.

Okay. Here's the evidence... Or something. Looking at this I begin to
see why things work the way they do. It's utterly bizzare, quite
frankly.

> Please try again. Backtrack in your code to where you are pulling the
> url out of an element. Do print repr(some_element.some_attribute).
> Show us.

Okay, the repr of the string that comes out of the .text attribute is:

"file://localhost/C:/Documents%20and%20Settings/TanksleyJrW/My
%20Documents/My%20Music/iTunes/iTunes%20Music/Podcasts/Brian
%20Preston's%20_Money%20Guy_%20Blog%20and%20Pod/Buffett%20Time%20-
%20Annual%20Shareholders%C2%A0L.mp3"

Looking at the XML, and THIS TIME actually looking at the correct
attribute (I was looking at the title before) I see... surprise!
That's the correct data.

So all of the mysteries are solved (except for my Python's
doublequotes, but who cares), and ElementTree is entirely vindicated.

-Wm