Processing XML File

Sells, Fred fred.sells at adventistcare.org
Fri Jan 29 14:31:56 EST 2010


Google is your friend.  Elementtree is one of the better documented
IMHO, but there are many modules to do this.

> -----Original Message-----
> From: python-list-bounces+frsells=adventistcare.org at python.org
> [mailto:python-list-bounces+frsells=adventistcare.org at python.org] On
> Behalf Of Stefan Behnel
> Sent: Friday, January 29, 2010 2:25 PM
> To: python-list at python.org
> Subject: Re: Processing XML File
> 
> jakecjacobson, 29.01.2010 18:25:
> > I need to take a XML web resource and split it up into smaller XML
> > files.  I am able to retrieve the web resource but I can't find any
> > good XML examples.  I am just learning Python so forgive me if this
> > question has been answered many times in the past.
> >
> > My resource is like:
> >
> > <document>
> >      ...
> >      ...
> > </document>
> > <document>
> >      ...
> >      ...
> > </document>
> 
> Is this what you get as a document or is this just /contained/ in the
> document?
> 
> Note that XML does not allow more than one root element, so the above
is
> not XML. Each of the two <document>...</document> parts form an XML
> document by themselves, though.
> 
> 
> > So in this example, I would need to output 2 files with the contents
> > of each file what is between the open and close document tag.
> 
> Are the two files formatted as you show above? In that case, you can
> simply
> iterate over the lines and cut the document when you see "<document>".
Or,
> if you are sure that "<document>" only appears as top-most elements
and
> not
> inside of the documents, you can search for "<document>" in the
content (a
> string, I guess) and split it there.
> 
> As was pointed out before, once you have these two documents, use the
> xml.etree package to work with them.
> 
> Something like this might work:
> 
>     import xml.etree.ElementTree as ET
> 
>     data = urllib2.urlopen(url).read()
> 
>     for part in data.split('<document>'):
>         document = ET.fromstring('<document>'+part)
>         print(document.tag)
>         # ... do other stuff
> 
> Stefan
> --
> http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list