splitting an XML file on the basis on basis of XML tags

Wed Apr 2 08:37:27 EDT 2008

bije... at gmail.com wrote:
> Hi all,
>
>          i have an XML file with the following structure::
>
> <r1>
> <r2>-----|
> <r3>     |
> <r4>     |
> .           |
> .           |         --------------------> constitutes one record.
> .           |
> .           |
> .           |
> </r4>    |
> </r3>    |
> </r2>----|
> <r2>
> .
> .
> .    -----------------------|
> .                           |
> .                           |
> .                           |----------------------> there are n
> records in between....
> .                           |
> .                           |
> .                           |
> .   ------------------------|
> .
> .
> </r2>
> <r2>-----|
> <r3>     |
> <r4>     |
> .           |
> .           |         --------------------> constitutes one record.
> .           |
> .           |
> .           |
> </r4>    |
> </r3>    |
> </r2>----|
> </r1>
>
>
>        Here <r1> is the main root tag of the XML, and <r2>...</r2>
> constitutes one record. What I would like to do is
> to extract everything (xml tags and data) between nth <r2> tag and (n
> +k)th <r2> tag. The extracted data is to be
> written down to a separate file.
>
> Thanks...

You could create a generator expression out of it:

txt = """<r1>
    <r2><r3><r4>1</r4></r3></r2>
    <r2><r3><r4>2</r4></r3></r2>
    <r2><r3><r4>3</r4></r3></r2>
    <r2><r3><r4>4</r4></r3></r2>
    <r2><r3><r4>5</r4></r3></r2>
    </r1>
    """
l = len(txt.split('r2>'))-1
a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
and i.replace('>','').replace('<','').strip())

Now you have a generator you can iterate through with a.next() or
alternatively you could just create a list out of it by replacing the
outer parens with square brackets.