A Unique XML Parsing Problem

Stefan Behnel stefan_ml at behnel.de
Sun Oct 24 03:00:23 EDT 2010


Devon, 24.10.2010 01:40:
> I must quickly and efficiently parse some data contained in multiple
> XML files in order to perform some learning algorithms on the data.
>
> I have thousands of files, each file corresponds to a single song.
> Each XML file contains information extracted from the song (called
> features). Examples include tempo, time signature, pitch classes, etc.
 > [...]
> I am a statistician and therefore used to data being stored in CSV-
> like files, with each row being a datapoint, and each column being a
> feature. I would like to parse the data out of these XML files and
> write them out into a CSV file.  Any help would be greatly appreciated.
> Mostly I am looking for a point in the right direction. I have heard
> about Beautiful Soup but never used it. I am currently reading Dive
> Into Python's chapters on HTML and XML parsing.

That chapter is mostly out of date, and BeautifulSoup is certainly not the 
right tool for dealing with XML, both for performance and compliance 
reasons. If you need performance, as you stated above, look at cElementTree 
in the stdlib.


> And I am also more
> concerned about how to use the tags in the XML files to build feature
> names so I do not have to hard code them. For example, the first
> feature given by the above code would be "track duration" with a value
> of 29.12331

If the rules are as simple as that (i.e. tag name + attribute name), it'll 
be easy going with ElementTree. Don't put too much effort into separating 
the data from the XML format, though. XML parsing is fast and has the clear 
advantage over CSV files that the data is safely stored in a well defined, 
expressive format, including character encoding and named data fields.

Stefan




More information about the Python-list mailing list