XML parsing per record

Wed Apr 20 08:03:00 EDT 2005

Willem Ligtenberg wrote:
>>Willem Ligtenberg <WLigtenberg at gmail.com> wrote:
>>
>>>I want to parse a very large (2.4 gig) XML file (bioinformatics
>>>ofcourse :)) But I have no clue how to do that. Most things I see read
>>>the entire xml file at once. That isn't going to work here ofcourse.
>>>
>>>So I would like to parse a XML file one record at a time and then be
>>>able to store the information in another object.  How should I do
>>>that?
> 
> The XML file I need to parse contains information about genes.
> So the first element is a gene and then there are a lot sub-elements with
> sub-elements. I only need some of the informtion and want to store it in
> my an object called gene. Lateron this information will be printed into a
> file, which in it's turn will be fed into some other program.
> This is an example of the XML
> <?xml version="1.0"?>
> <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
> <Entrezgene-Set>
>   <Entrezgene>
>     <snip>
>   </Entrezgene>
> </Entrezgene-Set>

This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml'

for event, elem in ElementTree.iterparse(source):
     if elem.tag == 'Entrezgene':
         # Process the Entrezgene element
         geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
         print 'Gene id', geneid

         # Throw away the element, we're done with it
         elem.clear()

Kent