XML -> Tab-delimited text file (using lxml)

Gibson str1d3r at gmail.com
Wed Nov 19 10:47:55 EST 2008


I'm attempting to do the following:
A) Read/scan/iterate/etc. through a semi-large XML file (about 135 mb)
B) Grab specific fields and output to a tab-delimited text file

The only problem I'm having is that the tab-delimited text file
requires a different order of values than which appear in the XML
file. Example below.

<Title>
   <Item ID="1234abcd">
      <ItemVal ValueID="image" value="image.jpg" />
      <ItemVal ValueID="name" value="My Wonderful Product 1" />
      <ItemVal ValueID="description" value="My Wonderful Product 1 is
a wonderful product, indeed." />
   </Item>
   <Item ID="2345bcde">
      <ItemVal ValueID="image" value="image2.jpg" />
      <ItemVal ValueID="name" value="My Wonderful Product 2" />
      <ItemVal ValueID="description" value="My Wonderful Product 2 is
a wonderful product, indeed." />
   </Item>
   <Item ID="3456cdef">
      <ItemVal ValueID="image" value="image3.jpg" />
      <ItemVal ValueID="description" value="My Wonderful Product 3 is
a wonderful product, indeed." />
      <ItemVal ValueID="name" value="My Wonderful Product 3" />
   </Item>
</Title>

(Note: The last item "3456cdef" shows the description value as being
before the name, where as in previous items, it comes after. This is
to simulate the XML data with which I am working.)
And the tab-delimited text file should appear as follows: (tabs are as
2 spaces, for the sake of readability here)

(ID,name,description,image)
1234abcd  My Wonderful Product 1  My Wonderful Product 1 is a
wonderful product, indeed.  image.jpg
2345bcde  My Wonderful Product 2  My Wonderful Product 2 is a
wonderful product, indeed.  image2.jpg
3456cdef  My Wonderful Product 3  My Wonderful Product 3 is a
wonderful product, indeed.  image3.jpg

Currently, I'm working with the lxml library for iteration and
parsing, though this is proving to be a bit of a challenge for data
that needs to be reorganized (such as mine). Sample below.

''' Start code '''

from lxml import etree

def main():
  # Far too much room would be taken up if I were to paste my
  # real code here, so I will give a smaller example of what
  # I'm doing. Also, I do realize this is a very naive way to do
  # what it is I'm trying to accomplish... besides the fact
  # that it doesn't work as intended in the first place.

  out = open('output.txt','w')
  cat = etree.parse('catalog.xml')
  for el in cat.iter():
    # Search for the first item, make a new line for it
	# and output the ID
    if el.tag == "Item":
      out.write("\n%s\t" % (el.attrib['ID']))
    elif el.tag == "ItemVal":
      if el.attrib['ValueID'] == "name":
	    out.write("%s\t" % (el.attrib['value']))
	  elif el.attrib['ValueID'] == "description":
	    out.write("%s\t" % (el.attrib['value']))
	  elif el.attrib['ValueID'] == "image":
	    out.write("%s\t" % (el.attrib['value']))
  out.close()

if __name__ == '__main__': main()

''' End code '''

I now realize that etree.iter() is meant to be used in an entirely
different fashion, but my brain is stuck on this naive way of coding.
If someone could give me a push in any correct direction I would be
most grateful.



More information about the Python-list mailing list