splitting an XML file on the basis on basis of XML tags

Stefan Behnel stefan_ml at behnel.de
Mon Apr 7 01:59:33 EDT 2008


bijeshn at gmail.com schrieb:
> Hi all,
> 
>          i have an XML file with the following structure::
> 
> <r1>
> <r2>-----|
> <r3>     |
> <r4>     |
> .           |
> .           |         --------------------> constitutes one record.
> .           |
> .           |
> .           |
> </r4>    |
> </r3>    |
> </r2>----|
> <r2>
> .
> .
> .    -----------------------|
> .                           |
> .                           |
> .                           |----------------------> there are n
> records in between....
> .                           |
> .                           |
> .                           |
> .   ------------------------|
> .
> .
> </r2>
> <r2>-----|
> <r3>     |
> <r4>     |
> .           |
> .           |         --------------------> constitutes one record.
> .           |
> .           |
> .           |
> </r4>    |
> </r3>    |
> </r2>----|
> </r1>
> 
> 
>        Here <r1> is the main root tag of the XML, and <r2>...</r2>
> constitutes one record. What I would like to do is
> to extract everything (xml tags and data) between nth <r2> tag and (n
> +k)th <r2> tag. The extracted data is to be
> written down to a separate file.

What do you mean by "written down to a separate file"? Do you have a specific
format in mind?

In general, you can try this:

    >>> from xml.etree import cElementTree as ET
    >>> itercontext = ET.iterparse("thefile.xml", events=("start", "end")
    >>> event,root = itercontext.next()
    >>> for event,element in itercontext:
    ...     if event == "end" and element.tag == "r2":
    ...         print ET.tostring(element) # write record subtree as XML
    ...         root.clear() # one record done, clean up everything

http://effbot.org/zone/element-iterparse.htm

You can also do things like

    ...         print element.findtext("r3/r4")

Read the ElementTree tutorial to learn how to extract your data:

http://effbot.org/zone/element.htm#searching-for-subelements

Stefan



More information about the Python-list mailing list