[Tutor] Python XML for newbie

Mon Jul 2 18:49:59 CEST 2012

Peter Otten, 02.07.2012 09:57:
> Sean Carolan wrote:
>>> Thank you, this is helpful.  Minidom is confusing, even the
>>> documentation confirms this:
>>> "The name of the functions are perhaps misleading...."

Yes, I personally think that (Mini)DOM should be locked away from beginners
as far as possible.

>> Ok, so I read through these tutorials and am at least able to print
>> the XML output now.  I did this:
>>
>> doc = etree.parse('computer_books.xml')
>>
>> and then this:
>>
>> for elem in doc.iter():
>>     print elem.tag, elem.text
>>
>> Here's the data I'm interested in:
>>
>> index 1
>> field 11
>> value 9780596526740
>> datum
>>
>> How do you say, "If the field is 11, then print the next value"?  The
>> raw XML looks like this:
>>
>> <datum>
>> <index>1</index>
>> <field>11</field>
>> <value>9780470286975</value>
>> </datum>
>>
>> Basically I just want to pull all these ISBN numbers from the file.
> 
> With http://lxml.de/ you can use xpath:
> 
> $ cat computer_books.xml 
> <foo>
>     <bar>
>         <datum>
>             <index>1</index>
>             <field>11</field>
>             <value>9780470286975</value>
>         </datum>
>     </bar>
> </foo>
> $ cat read_isbn.py
> from lxml import etree
> 
> root = etree.parse("computer_books.xml")
> print root.xpath("//datum[field=11]/value/text()")
> $ python read_isbn.py 
> ['9780470286975']
> $ 

And lxml.objectify is also a nice tool for this:

  $ cat example.xml
  <items>
   <item>
    <id>108</id>
    <data>
     <datum>
      <index>1</index>
      <field>2</field>
      <value>Essential System Administration</value>
     </datum>
    </data>
   </item>
  </items>

  $ python
  Python 2.7.3
  >>> from lxml import objectify
  >>> t = objectify.parse('example.xml')
  >>> for datum in t.iter('datum'):
  ...     if datum.field == 2:
  ...         print(datum.value)
  ...
  Essential System Administration
  >>>

It's not impossible that this is faster than the XPath version, but that
depends a lot on the data.

Stefan