splitting an XML file on the basis on basis of XML tags
Steve Holden
steve at holdenweb.com
Thu Apr 3 02:51:42 EDT 2008
bijeshn wrote:
> On Apr 2, 5:37 pm, Chris <cwi... at gmail.com> wrote:
>> bije... at gmail.com wrote:
>>> Hi all,
>>> i have an XML file with the following structure::
>>> <r1>
>>> <r2>-----|
>>> <r3> |
>>> <r4> |
>>> . |
>>> . | --------------------> constitutes one record.
>>> . |
>>> . |
>>> . |
>>> </r4> |
>>> </r3> |
>>> </r2>----|
>>> <r2>
>>> .
>>> .
>>> . -----------------------|
>>> . |
>>> . |
>>> . |----------------------> there are n
>>> records in between....
>>> . |
>>> . |
>>> . |
>>> . ------------------------|
>>> .
>>> .
>>> </r2>
>>> <r2>-----|
>>> <r3> |
>>> <r4> |
>>> . |
>>> . | --------------------> constitutes one record.
>>> . |
>>> . |
>>> . |
>>> </r4> |
>>> </r3> |
>>> </r2>----|
>>> </r1>
>>> Here <r1> is the main root tag of the XML, and <r2>...</r2>
>>> constitutes one record. What I would like to do is
>>> to extract everything (xml tags and data) between nth <r2> tag and (n
>>> +k)th <r2> tag. The extracted data is to be
>>> written down to a separate file.
>>> Thanks...
>> You could create a generator expression out of it:
>>
>> txt = """<r1>
>> <r2><r3><r4>1</r4></r3></r2>
>> <r2><r3><r4>2</r4></r3></r2>
>> <r2><r3><r4>3</r4></r3></r2>
>> <r2><r3><r4>4</r4></r3></r2>
>> <r2><r3><r4>5</r4></r3></r2>
>> </r1>
>> """
>> l = len(txt.split('r2>'))-1
>> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
>> and i.replace('>','').replace('<','').strip())
>>
>> Now you have a generator you can iterate through with a.next() or
>> alternatively you could just create a list out of it by replacing the
>> outer parens with square brackets.- Hide quoted text -
>>
>> - Show quoted text -
>
> Hmmm... will look into it.. Thanks
>
> the XML file is almost a TB in size...
>
Good grief. When will people stop abusing XML this way?
> so SAX will have to be the parser.... i'm thinking of doing something
> to split the file using SAX
> ... Any suggestions on those lines..? If there are any other parsers
> suitable, please suggest...
You could try pulldom, but the documentation is disgraceful.
ElementTree.iterparse *might* help.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
More information about the Python-list
mailing list