Processing XML File

Fri Jan 29 14:24:35 EST 2010

jakecjacobson, 29.01.2010 18:25:
> I need to take a XML web resource and split it up into smaller XML
> files.  I am able to retrieve the web resource but I can't find any
> good XML examples.  I am just learning Python so forgive me if this
> question has been answered many times in the past.
> 
> My resource is like:
> 
> <document>
>      ...
>      ...
> </document>
> <document>
>      ...
>      ...
> </document>

Is this what you get as a document or is this just /contained/ in the document?

Note that XML does not allow more than one root element, so the above is
not XML. Each of the two <document>...</document> parts form an XML
document by themselves, though.

> So in this example, I would need to output 2 files with the contents
> of each file what is between the open and close document tag.

Are the two files formatted as you show above? In that case, you can simply
iterate over the lines and cut the document when you see "<document>". Or,
if you are sure that "<document>" only appears as top-most elements and not
inside of the documents, you can search for "<document>" in the content (a
string, I guess) and split it there.

As was pointed out before, once you have these two documents, use the
xml.etree package to work with them.

Something like this might work:

    import xml.etree.ElementTree as ET

    data = urllib2.urlopen(url).read()

    for part in data.split('<document>'):
        document = ET.fromstring('<document>'+part)
        print(document.tag)
        # ... do other stuff

Stefan