splitting an XML file on the basis on basis of XML tags

Steve Holden steve at holdenweb.com
Thu Apr 3 02:51:42 EDT 2008


bijeshn wrote:
> On Apr 2, 5:37 pm, Chris <cwi... at gmail.com> wrote:
>> bije... at gmail.com wrote:
>>> Hi all,
>>>          i have an XML file with the following structure::
>>> <r1>
>>> <r2>-----|
>>> <r3>     |
>>> <r4>     |
>>> .           |
>>> .           |         --------------------> constitutes one record.
>>> .           |
>>> .           |
>>> .           |
>>> </r4>    |
>>> </r3>    |
>>> </r2>----|
>>> <r2>
>>> .
>>> .
>>> .    -----------------------|
>>> .                           |
>>> .                           |
>>> .                           |----------------------> there are n
>>> records in between....
>>> .                           |
>>> .                           |
>>> .                           |
>>> .   ------------------------|
>>> .
>>> .
>>> </r2>
>>> <r2>-----|
>>> <r3>     |
>>> <r4>     |
>>> .           |
>>> .           |         --------------------> constitutes one record.
>>> .           |
>>> .           |
>>> .           |
>>> </r4>    |
>>> </r3>    |
>>> </r2>----|
>>> </r1>
>>>        Here <r1> is the main root tag of the XML, and <r2>...</r2>
>>> constitutes one record. What I would like to do is
>>> to extract everything (xml tags and data) between nth <r2> tag and (n
>>> +k)th <r2> tag. The extracted data is to be
>>> written down to a separate file.
>>> Thanks...
>> You could create a generator expression out of it:
>>
>> txt = """<r1>
>>     <r2><r3><r4>1</r4></r3></r2>
>>     <r2><r3><r4>2</r4></r3></r2>
>>     <r2><r3><r4>3</r4></r3></r2>
>>     <r2><r3><r4>4</r4></r3></r2>
>>     <r2><r3><r4>5</r4></r3></r2>
>>     </r1>
>>     """
>> l = len(txt.split('r2>'))-1
>> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
>> and i.replace('>','').replace('<','').strip())
>>
>> Now you have a generator you can iterate through with a.next() or
>> alternatively you could just create a list out of it by replacing the
>> outer parens with square brackets.- Hide quoted text -
>>
>> - Show quoted text -
> 
> Hmmm... will look into it.. Thanks
> 
> the XML file is almost a TB in size...
> 
Good grief. When will people stop abusing XML this way?

> so SAX will have to be the parser.... i'm thinking of doing something
> to split the file using SAX
> ... Any suggestions on those lines..? If there are any other parsers
> suitable, please suggest...

You could try pulldom, but the documentation is disgraceful.

ElementTree.iterparse *might* help.

regards
  Steve

-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/




More information about the Python-list mailing list