[Pythonmac-SIG] XML Parsing

Steve Majewski sdm7g at mac.com
Sat Feb 7 08:18:21 CET 2009


Not clear from your question whether your goal is to learn to parse  
XML in python
or to solve a particular problem. If your goal is to learn python XML  
processing,
then go right ahead -- however, it looks like you are using SAX below,  
and the sort
of thing you describe might be done better using a DOM parser ( or  
maybe etree )

If what you want is not just to select some info from the xml file,  
but to get it
into a Python object so that you can then manipulate it further, then  
DOM or etree
is also probably a better model. It will parse the XML ( likely using  
SAX underneath )
and give you an object that encodes the whole file.

[ Not that it can't be done in SAX -- it's just that, as you  
discovered, low level
   SAX parsing requires that you keep track of the containment  
hierarchy yourself,
   which is a lot of work to solve a simple problem. ]


If you're just trying to work with XML, then most folks don't write  
XML parsers for
that sort of thing, but use higher level tools: XSLT, XPATH and or  
XQUERY.

The Mac has xsltproc as a built-in xslt (1.0) processor.
There is a xpath program written in perl in Leopard/10.5. ( /usr/bin/ 
xpath )
And Saxon is easily downloaded and does xslt 2.0 and xquery 1.0 .


The following XSLT 1.0 stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"  
version="1.0">
<xsl:output method="text"/>

<xsl:template match="/">
     <xsl:apply-templates select="/topalbums/album[@rank &lt; 6]"/>
     <!-- just select the top 5 albums -->
</xsl:template>

<xsl:template match="/topalbums/album" >
   album: <xsl:value-of select="name"/>
   artist: <xsl:value-of select="artist/name"/>
   count=<xsl:value-of select="playcount"/>
   <xsl:text>
   </xsl:text> <!-- this is here to insert the blank line break -->
</xsl:template>

</xsl:stylesheet>


Will, when run on that file, produce this output:
~$ xsltproc Untitled1.xsl  topalbums.xml

   album: Vheissu
   artist: Thrice
   count=332

   album: The Artist in the Ambulance
   artist: Thrice
   count=289

   album: Appeal To Reason
   artist: Rise Against
   count=286

   album: Favourite Worst Nightmare
   artist: Arctic Monkeys
   count=210

   album: The Sufferer & The Witness
   artist: Rise Against
   count=206

[ Not sure if that's anything like what you want. ]


I'm sure that the whole thing would reduce to an even more concise  
XQuery request.

I was trying to do the whole thing as an xpath one liner, but it  
didn't like
my attempts to include alternates in parenthesis. I think this is an  
xpath 1.0
vs. xpath 2.0 issue. Saxon is the only thing that supports 2.0. The  
perl, python
and java libraries only support xpath 1.0.

This sort of expression did work using xpath 2.0 (in oxygen editor):

	//album[@rank < 6]/(name|playcount|artist/name)

But I couldn't figure out a 1.0 syntax that would grab all three fields.

( and the perl xpath seems to have a bug that interprets '@rank < 6'  
as less-than-or-equal! )


-- Steve Majewski



On Feb 6, 2009, at 11:00 PM, Bryan Smith wrote:

> Hi everyone,
>
> I have another question I'm hoping someone would be kind enough to  
> answer. I am new to parsing XML (not to mention much of Python  
> itself) and I am trying to parse an XML file. The file I am trying  
> to parse is this one: http://ws.audioscrobbler.com/2.0/user/bryansmith/topalbums.xml 
> .
>
> So far, I have written up a class for parsing this file in my  
> attempts to present to the user a list of top albums on their  
> last.fm profile. If you note, the artist name and album name are  
> both signified by the <name> tag which makes my job harder. If the  
> tag names were different, I wouldn't have a problem. Listed below is  
> the class I have written to parse the file. My question then is  
> this: is there a way I can say something like "if tag_name == album  
> name tag then....elif tag_name == artist name tag....". I hope this  
> is clear.
>
> As it stands right now, if I parse this file and print the results,  
> this is what I get (understandably) if I try to print out in the  
> following fashion - album (playcount): Vheissu (332), Thrice (289),  
> The Artist in the Ambulance (286), Thrice (210) and so on. Thrice is  
> the artist name. I want to be able to differentiate between the  
> "artist" name tag and the "album" name tag.
>
>
> Class as it stands right now:
>
> class GetTopAlbums(ContentHandler):
>
>     in_album_tag = False
>     in_playcount_tag = False
>
>     def __init__(self, album, playcount):
>         ContentHandler.__init__(self)
>         self.album = album
>         self.playcount = playcount
>         self.data = []
>
>     def startElement(self, tag_name, attr):
>         if tag_name == "name":
>             self.in_album_tag = True
>         elif tag_name == "playcount":
>             self.in_playcount_tag = True
>
>     def endElement(self, tag_name):
>         if tag_name == "name":
>             content = "".join(self.data)
>             self.data = []
>             self.album.append(content)
>             self.in_album_tag = False
>         elif tag_name == "playcount":
>             content = "".join(self.data)
>             self.data = []
>             self.playcount.append(content)
>             self.in_playcount_tag = False
>
>     def characters(self, string):
>         if self.in_album_tag == True:
>             self.data.append(string)
>         elif self.in_playcount_tag == True:
>             self.data.append(string)
>
> Thanks in advance!
> Bryan
> _______________________________________________
> Pythonmac-SIG maillist  -  Pythonmac-SIG at python.org
> http://mail.python.org/mailman/listinfo/pythonmac-sig



More information about the Pythonmac-SIG mailing list