XML parsing per record

Thu Apr 21 09:17:48 EDT 2005

Sorry I just decided that I want to use your solution, but I am wondering
is cElemenTree in expat or is that something different?

On Wed, 20 Apr 2005
08:03:00 -0400, Kent Johnson wrote:

> Willem Ligtenberg wrote:
>>>Willem Ligtenberg <WLigtenberg at gmail.com> wrote:
>>>
>>>>I want to parse a very large (2.4 gig) XML file (bioinformatics
>>>>ofcourse :)) But I have no clue how to do that. Most things I see read
>>>>the entire xml file at once. That isn't going to work here ofcourse.
>>>>
>>>>So I would like to parse a XML file one record at a time and then be
>>>>able to store the information in another object.  How should I do
>>>>that?
>> 
>> The XML file I need to parse contains information about genes.
>> So the first element is a gene and then there are a lot sub-elements with
>> sub-elements. I only need some of the informtion and want to store it in
>> my an object called gene. Lateron this information will be printed into a
>> file, which in it's turn will be fed into some other program.
>> This is an example of the XML
>> <?xml version="1.0"?>
>> <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
>> <Entrezgene-Set>
>>   <Entrezgene>
>>     <snip>
>>   </Entrezgene>
>> </Entrezgene-Set>
> 
> This should get you started with cElementTree:
> 
> import cElementTree as ElementTree
> 
> source = 'Entrezgene.xml'
> 
> for event, elem in ElementTree.iterparse(source):
>      if elem.tag == 'Entrezgene':
>          # Process the Entrezgene element
>          geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
>          print 'Gene id', geneid
> 
>          # Throw away the element, we're done with it
>          elem.clear()
> 
> Kent