Python parsing iTunes XML/COM

John Machin sjmachin at lexicon.net
Thu Jul 31 19:19:37 EDT 2008


On Aug 1, 7:44 am, william tanksley <wtanksle... at gmail.com> wrote:
> John Machin <sjmac... at lexicon.net> wrote:
> > william tanksley <wtanksle... at gmail.com> wrote:
> > Let's try again:
>
> Cool. Sorry for the misunderstanding. Thank you for helping again!
>
> Postscript: your request to print the actual data did the trick.

I'd back inspecting actual data against armchair philosophy any
time :-)

> I'm
> including the rest of my reply just to provide context, but the answer
> was the the Unicode was actually embedded in the URL, encoded as
> distinct bytes. Thus, it *had* to be url-decoded and then UTF-8
> decoded, in that order, in order to recover the original filename.
>
> So the problem was indeed purely in my head -- I should have looked at
> the original data (unfortunately, I was fooled by looking at the song
> title, which is the same thing but with the raw UTF-8 bytes instead of
> the URL escape codes).
>
>
>
> > >> track_id = url2pathname(urlparse(track_id).path)
> > >> print repr(track_id)
> > >> parse_result = urlparse(track_id).path
> > >> print repr(parse_result)
> > >> track_id_replacement = url2pathname(parse_result)
> > >> print repr(track_id_replacement)
> > > The "important" value here is track_id_replacement; it contains the
> > > data that's throwing me. It appears that some UTF-8 characters are
> > > being read as multiple bytes by ElementTree rather than being decoded
> > > into Unicode.
> > > Here's one example. The others are similar -- they have the same
> > > things that look like problems to me.
> > > "Buffett Time - Annual Shareholders\xc2\xa0L.mp3"
> > ROTFL! I thought the Buffett thing was a Windows filename! What I was
> > expecting was THREE lots of repr() output, and I'm quite unused to
> > seeing repr() output with quotes around it instead of apostrophes; how
> > did you achieve that?
>
> I don't know -- but I got it again when I printed out the original
> version. My *guess* would be that this is what repr prints when asked
> to print a byte string (but I don't know how to confirm that).
> Alternately, the fact that I'm running these inside SPE might be
> changing some defaults. I'm not sure.
>
> You're right that single quotes are expected -- and I'd expect a
> preceding u, since they're supposed to be Unicode. I dunno what's
> going on.

Why do you suppose that the contents are Unicode? It's a URL-encoded
string i.e. *deliberately* ASCII, in fact sub-ASCII (see all the %20
stuff?). What's going on is that ElementTree presents text as ASCII if
it can be so represented, otherwise as Unicode. This is actually a
*convenience*. Get used to it. Enjoy it.

>
> > So you're saying that track_id_replacement contains utf8 characters.
> > It is obtained by track_id_replacement = url2pathname(parse_result).
> > You don't show us what is in parse_result. url2pathname() is nothing
> > to do with ElementTree. urlparse() is nothing to do with ElementTree.
> > You have provided no evidence that ElementTree is doing what you
> > accuse it of.
>
> Okay. Here's the evidence... Or something. Looking at this I begin to
> see why things work the way they do. It's utterly bizzare, quite
> frankly.
>
> > Please try again. Backtrack in your code to where you are pulling the
> > url out of an element. Do print repr(some_element.some_attribute).
> > Show us.
>
> Okay, the repr of the string that comes out of the .text attribute is:
>
> "file://localhost/C:/Documents%20and%20Settings/TanksleyJrW/My
> %20Documents/My%20Music/iTunes/iTunes%20Music/Podcasts/Brian
> %20Preston's%20_Money%20Guy_%20Blog%20and%20Pod/Buffett%20Time%20-
> %20Annual%20Shareholders%C2%A0L.mp3"
>
> Looking at the XML, and THIS TIME actually looking at the correct
> attribute (I was looking at the title before) I see... surprise!
> That's the correct data.
>
> So all of the mysteries are solved (except for my Python's
> doublequotes, but who cares), and ElementTree is entirely vindicated.

Shucks. I can sense that you'd been looking forward to conducting an
auto-da-fe followed by tossing the author on a bonfire ... but you
can't burn a bot anyway :-)



More information about the Python-list mailing list