splitting an XML file on the basis on basis of XML tags
Chris
cwitts at gmail.com
Thu Apr 3 06:27:56 EDT 2008
On Apr 3, 8:51 am, Steve Holden <st... at holdenweb.com> wrote:
> bijeshn wrote:
> > On Apr 2, 5:37 pm, Chris <cwi... at gmail.com> wrote:
> >> bije... at gmail.com wrote:
> >>> Hi all,
> >>> i have an XML file with the following structure::
> >>> <r1>
> >>> <r2>-----|
> >>> <r3> |
> >>> <r4> |
> >>> . |
> >>> . | --------------------> constitutes one record.
> >>> . |
> >>> . |
> >>> . |
> >>> </r4> |
> >>> </r3> |
> >>> </r2>----|
> >>> <r2>
> >>> .
> >>> .
> >>> . -----------------------|
> >>> . |
> >>> . |
> >>> . |----------------------> there are n
> >>> records in between....
> >>> . |
> >>> . |
> >>> . |
> >>> . ------------------------|
> >>> .
> >>> .
> >>> </r2>
> >>> <r2>-----|
> >>> <r3> |
> >>> <r4> |
> >>> . |
> >>> . | --------------------> constitutes one record.
> >>> . |
> >>> . |
> >>> . |
> >>> </r4> |
> >>> </r3> |
> >>> </r2>----|
> >>> </r1>
> >>> Here <r1> is the main root tag of the XML, and <r2>...</r2>
> >>> constitutes one record. What I would like to do is
> >>> to extract everything (xml tags and data) between nth <r2> tag and (n
> >>> +k)th <r2> tag. The extracted data is to be
> >>> written down to a separate file.
> >>> Thanks...
> >> You could create a generator expression out of it:
>
> >> txt = """<r1>
> >> <r2><r3><r4>1</r4></r3></r2>
> >> <r2><r3><r4>2</r4></r3></r2>
> >> <r2><r3><r4>3</r4></r3></r2>
> >> <r2><r3><r4>4</r4></r3></r2>
> >> <r2><r3><r4>5</r4></r3></r2>
> >> </r1>
> >> """
> >> l = len(txt.split('r2>'))-1
> >> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
> >> and i.replace('>','').replace('<','').strip())
>
> >> Now you have a generator you can iterate through with a.next() or
> >> alternatively you could just create a list out of it by replacing the
> >> outer parens with square brackets.- Hide quoted text -
>
> >> - Show quoted text -
>
> > Hmmm... will look into it.. Thanks
>
> > the XML file is almost a TB in size...
>
> Good grief. When will people stop abusing XML this way?
>
> > so SAX will have to be the parser.... i'm thinking of doing something
> > to split the file using SAX
> > ... Any suggestions on those lines..? If there are any other parsers
> > suitable, please suggest...
>
> You could try pulldom, but the documentation is disgraceful.
>
> ElementTree.iterparse *might* help.
>
> regards
> Steve
>
> --
> Steve Holden +1 571 484 6266 +1 800 494 3119
> Holden Web LLC http://www.holdenweb.com/
I abuse it because I can (and because I don't generally work with XML
files larger than 20-30meg) :)
And the OP never said the XML file for 1TB in size, which makes things
different.
More information about the Python-list
mailing list