splitting an XML file on the basis on basis of XML tags

bijeshn bijeshn at gmail.com
Thu Apr 3 02:12:25 EDT 2008


On Apr 2, 5:37 pm, Chris <cwi... at gmail.com> wrote:
> bije... at gmail.com wrote:
> > Hi all,
>
> >          i have an XML file with the following structure::
>
> > <r1>
> > <r2>-----|
> > <r3>     |
> > <r4>     |
> > .           |
> > .           |         --------------------> constitutes one record.
> > .           |
> > .           |
> > .           |
> > </r4>    |
> > </r3>    |
> > </r2>----|
> > <r2>
> > .
> > .
> > .    -----------------------|
> > .                           |
> > .                           |
> > .                           |----------------------> there are n
> > records in between....
> > .                           |
> > .                           |
> > .                           |
> > .   ------------------------|
> > .
> > .
> > </r2>
> > <r2>-----|
> > <r3>     |
> > <r4>     |
> > .           |
> > .           |         --------------------> constitutes one record.
> > .           |
> > .           |
> > .           |
> > </r4>    |
> > </r3>    |
> > </r2>----|
> > </r1>
>
> >        Here <r1> is the main root tag of the XML, and <r2>...</r2>
> > constitutes one record. What I would like to do is
> > to extract everything (xml tags and data) between nth <r2> tag and (n
> > +k)th <r2> tag. The extracted data is to be
> > written down to a separate file.
>
> > Thanks...
>
> You could create a generator expression out of it:
>
> txt = """<r1>
>     <r2><r3><r4>1</r4></r3></r2>
>     <r2><r3><r4>2</r4></r3></r2>
>     <r2><r3><r4>3</r4></r3></r2>
>     <r2><r3><r4>4</r4></r3></r2>
>     <r2><r3><r4>5</r4></r3></r2>
>     </r1>
>     """
> l = len(txt.split('r2>'))-1
> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
> and i.replace('>','').replace('<','').strip())
>
> Now you have a generator you can iterate through with a.next() or
> alternatively you could just create a list out of it by replacing the
> outer parens with square brackets.- Hide quoted text -
>
> - Show quoted text -

Hmmm... will look into it.. Thanks

the XML file is almost a TB in size...

so SAX will have to be the parser.... i'm thinking of doing something
to split the file using SAX
... Any suggestions on those lines..? If there are any other parsers
suitable, please suggest...



More information about the Python-list mailing list