A Unique XML Parsing Problem
Stefan Behnel
stefan_ml at behnel.de
Sun Oct 24 03:00:23 EDT 2010
Devon, 24.10.2010 01:40:
> I must quickly and efficiently parse some data contained in multiple
> XML files in order to perform some learning algorithms on the data.
>
> I have thousands of files, each file corresponds to a single song.
> Each XML file contains information extracted from the song (called
> features). Examples include tempo, time signature, pitch classes, etc.
> [...]
> I am a statistician and therefore used to data being stored in CSV-
> like files, with each row being a datapoint, and each column being a
> feature. I would like to parse the data out of these XML files and
> write them out into a CSV file. Any help would be greatly appreciated.
> Mostly I am looking for a point in the right direction. I have heard
> about Beautiful Soup but never used it. I am currently reading Dive
> Into Python's chapters on HTML and XML parsing.
That chapter is mostly out of date, and BeautifulSoup is certainly not the
right tool for dealing with XML, both for performance and compliance
reasons. If you need performance, as you stated above, look at cElementTree
in the stdlib.
> And I am also more
> concerned about how to use the tags in the XML files to build feature
> names so I do not have to hard code them. For example, the first
> feature given by the above code would be "track duration" with a value
> of 29.12331
If the rules are as simple as that (i.e. tag name + attribute name), it'll
be easy going with ElementTree. Don't put too much effort into separating
the data from the XML format, though. XML parsing is fast and has the clear
advantage over CSV files that the data is safely stored in a well defined,
expressive format, including character encoding and named data fields.
Stefan
More information about the Python-list
mailing list