splitting an XML file on the basis on basis of XML tags

Chris cwitts at gmail.com
Thu Apr 3 06:27:56 EDT 2008


On Apr 3, 8:51 am, Steve Holden <st... at holdenweb.com> wrote:
> bijeshn wrote:
> > On Apr 2, 5:37 pm, Chris <cwi... at gmail.com> wrote:
> >> bije... at gmail.com wrote:
> >>> Hi all,
> >>>          i have an XML file with the following structure::
> >>> <r1>
> >>> <r2>-----|
> >>> <r3>     |
> >>> <r4>     |
> >>> .           |
> >>> .           |         --------------------> constitutes one record.
> >>> .           |
> >>> .           |
> >>> .           |
> >>> </r4>    |
> >>> </r3>    |
> >>> </r2>----|
> >>> <r2>
> >>> .
> >>> .
> >>> .    -----------------------|
> >>> .                           |
> >>> .                           |
> >>> .                           |----------------------> there are n
> >>> records in between....
> >>> .                           |
> >>> .                           |
> >>> .                           |
> >>> .   ------------------------|
> >>> .
> >>> .
> >>> </r2>
> >>> <r2>-----|
> >>> <r3>     |
> >>> <r4>     |
> >>> .           |
> >>> .           |         --------------------> constitutes one record.
> >>> .           |
> >>> .           |
> >>> .           |
> >>> </r4>    |
> >>> </r3>    |
> >>> </r2>----|
> >>> </r1>
> >>>        Here <r1> is the main root tag of the XML, and <r2>...</r2>
> >>> constitutes one record. What I would like to do is
> >>> to extract everything (xml tags and data) between nth <r2> tag and (n
> >>> +k)th <r2> tag. The extracted data is to be
> >>> written down to a separate file.
> >>> Thanks...
> >> You could create a generator expression out of it:
>
> >> txt = """<r1>
> >>     <r2><r3><r4>1</r4></r3></r2>
> >>     <r2><r3><r4>2</r4></r3></r2>
> >>     <r2><r3><r4>3</r4></r3></r2>
> >>     <r2><r3><r4>4</r4></r3></r2>
> >>     <r2><r3><r4>5</r4></r3></r2>
> >>     </r1>
> >>     """
> >> l = len(txt.split('r2>'))-1
> >> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
> >> and i.replace('>','').replace('<','').strip())
>
> >> Now you have a generator you can iterate through with a.next() or
> >> alternatively you could just create a list out of it by replacing the
> >> outer parens with square brackets.- Hide quoted text -
>
> >> - Show quoted text -
>
> > Hmmm... will look into it.. Thanks
>
> > the XML file is almost a TB in size...
>
> Good grief. When will people stop abusing XML this way?
>
> > so SAX will have to be the parser.... i'm thinking of doing something
> > to split the file using SAX
> > ... Any suggestions on those lines..? If there are any other parsers
> > suitable, please suggest...
>
> You could try pulldom, but the documentation is disgraceful.
>
> ElementTree.iterparse *might* help.
>
> regards
>   Steve
>
> --
> Steve Holden        +1 571 484 6266   +1 800 494 3119
> Holden Web LLC              http://www.holdenweb.com/

I abuse it because I can (and because I don't generally work with XML
files larger than 20-30meg) :)
And the OP never said the XML file for 1TB in size, which makes things
different.



More information about the Python-list mailing list