A Unique XML Parsing Problem

Chris Rebert clp2 at rebertia.com
Sat Oct 23 20:02:55 EDT 2010


On Sat, Oct 23, 2010 at 4:40 PM, Devon <dshurick at gmail.com> wrote:
> I must quickly and efficiently parse some data contained in multiple
> XML files in order to perform some learning algorithms on the data.
> Info:
>
> I have thousands of files, each file corresponds to a single song.
> Each XML file contains information extracted from the song (called
> features). Examples include tempo, time signature, pitch classes, etc.
> An example from the beginning of one of these files looks like:
>
> <analysis decoder="Quicktime" version="0x7608000">
>    <track duration="29.12331" endOfFadeIn="0.00000"
> startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
> tempoConfidence="0.386" timeSignature="4"
> timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
> mode="0" modeConfidence="1.000">
>        <sections>
>            <section start="0.00000" duration="7.35887"/>
>            <section start="7.35887" duration="13.03414"/>
>            <section start="20.39301" duration="8.73030"/>
>        </sections>
>        <segments>
>            <segment start="0.00000" duration="0.56000">
>                <loudness>
>                    <dB time="0">-60.000</dB>
>                    <dB time="0.45279" type="max">-59.897</dB>
>                </loudness>
>                <pitches>
>                    <pitch class="0">0.589</pitch>
>                    <pitch class="1">0.446</pitch>
>                    <pitch class="2">0.518</pitch>
>                    <pitch class="3">1.000</pitch>
>                    <pitch class="4">0.850</pitch>
>                    <pitch class="5">0.414</pitch>
>                    <pitch class="6">0.326</pitch>
>                    <pitch class="7">0.304</pitch>
>                    <pitch class="8">0.415</pitch>
>                    <pitch class="9">0.566</pitch>
>                    <pitch class="10">0.353</pitch>
>                    <pitch class="11">0.350</pitch>
>
> I am a statistician and therefore used to data being stored in CSV-
> like files, with each row being a datapoint, and each column being a
> feature. I would like to parse the data out of these XML files and
> write them out into a CSV file. Any help would be greatly appreciated.
> Mostly I am looking for a point in the right direction.

ElementTree is a good way to go for XML parsing:
http://docs.python.org/library/xml.etree.elementtree.html
http://effbot.org/zone/element-index.htm
http://codespeak.net/lxml/

And for CSV writing there's obviously:
http://docs.python.org/library/csv.html

> And I am also more
> concerned about how to use the tags in the XML files to build feature
> names so I do not have to hard code them. For example, the first
> feature given by the above code would be "track duration" with a value
> of 29.12331

You'll probably want to look at namedtuple
(http://docs.python.org/library/collections.html#collections.namedtuple
) or the "bunch" recipe (google for "Python bunch").

Cheers,
Chris
--
http://blog.rebertia.com



More information about the Python-list mailing list