stripping fields from xml file into a csv

Sun Feb 28 11:41:21 EST 2010

On Feb 28, 12:05 am, Stefan Behnel <stefan... at behnel.de> wrote:
> Hal Styli, 27.02.2010 21:50:
>
> > I have a sed solution to the problems below but would like to rewrite
> > in python...
>
> Note that sed (or any other line based or text based tool) is not a
> sensible way to handle XML. If you want to read XML, use an XML parser.
> They are designed to do exactly what you want in a standard compliant way,
> and they can deal with all sorts of XML formatting and encoding, for example.
>
> > I need to strip out some data from a quirky xml file into a csv:
>
> > from something like this
>
> > < ..... cust="dick" .... product="eggs" ... quantity="12" .... >
> > < .... cust="tom" .... product="milk" ... quantity="2" ...>
> > < .... cust="harry" .... product="bread" ... quantity="1" ...>
> > < .... cust="tom" .... product="eggs" ... quantity="6" ...>
> > < ..... cust="dick" .... product="eggs" ... quantity="6" .... >
>
> As others have noted, this doesn't tell much about your XML. A more
> complete example would be helpful.
>
> > to this
>
> > dick,eggs,12
> > tom,milk,2
> > harry,bread,1
> > tom,eggs,6
> > dick,eggs,6
>
> > I am new to python and xml and it would be great to see some slick
> > ways of achieving the above by using python's XML capabilities to
> > parse the original file or python's regex to achive what I did using
> > sed.
>
> It's funny how often people still think that SAX is a good way to solve XML
> problems. Here's an untested solution that uses xml.etree.ElementTree:
>
>     from xml.etree import ElementTree as ET
>
>     csv_field_order = ['cust', 'product', 'quantity']
>
>     clean_up_used_elements = None
>     for event, element in ET.iterparse("thefile.xml", events=['start']):
>         # you may want to select a specific element.tag here
>
>         # format and print the CSV line to the standard output
>         print(','.join(element.attrib.get(title, '')
>                        for title in csv_field_order))
>
>         # safe some memory (in case the XML file is very large)
>         if clean_up_used_elements is None:
>             # this assigns the clear() method of the root (first) element
>             clean_up_used_elements = element.clear
>         clean_up_used_elements()
>
> You can strip everything dealing with 'clean_up_used_elements' (basically
> the last section) if your XML file is small enough to fit into memory (a
> couple of MB is usually fine).
>
> Stefan

This solution is so beautiful and elegant. Thank you. Now I am off to
learn ElementTree.

By the way, Stefan, I am using Python 2.6. Do you know the differences
between ElementTree and cElementTree?