XML parsing per record

Thu Apr 21 08:55:31 EDT 2005

I'll first try it using SAX, because I want to have as little dependancies
as possible. I already have BioPython as a dependancy. And I personally
don't like to install lot's of packages for a program to work. So I don't
want to impose that on other people.
But thanks anyway and I might go for the cElementTree later on, if the
ordinary SAX proves to slow...

On Wed, 20 Apr 2005 08:03:00 -0400,
Kent Johnson wrote:

> Willem Ligtenberg wrote:
>>>Willem Ligtenberg <WLigtenberg at gmail.com> wrote:
>>>
>>>>I want to parse a very large (2.4 gig) XML file (bioinformatics
>>>>ofcourse :)) But I have no clue how to do that. Most things I see read
>>>>the entire xml file at once. That isn't going to work here ofcourse.
>>>>
>>>>So I would like to parse a XML file one record at a time and then be
>>>>able to store the information in another object.  How should I do
>>>>that?
>> 
>> The XML file I need to parse contains information about genes.
>> So the first element is a gene and then there are a lot sub-elements with
>> sub-elements. I only need some of the informtion and want to store it in
>> my an object called gene. Lateron this information will be printed into a
>> file, which in it's turn will be fed into some other program.
>> This is an example of the XML
>> <?xml version="1.0"?>
>> <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
>> <Entrezgene-Set>
>>   <Entrezgene>
>>     <snip>
>>   </Entrezgene>
>> </Entrezgene-Set>
> 
> This should get you started with cElementTree:
> 
> import cElementTree as ElementTree
> 
> source = 'Entrezgene.xml'
> 
> for event, elem in ElementTree.iterparse(source):
>      if elem.tag == 'Entrezgene':
>          # Process the Entrezgene element
>          geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
>          print 'Gene id', geneid
> 
>          # Throw away the element, we're done with it
>          elem.clear()
> 
> Kent