XML parsing per record
Willem Ligtenberg
WLigtenberg at gmail.com
Thu Apr 21 08:55:31 EDT 2005
I'll first try it using SAX, because I want to have as little dependancies
as possible. I already have BioPython as a dependancy. And I personally
don't like to install lot's of packages for a program to work. So I don't
want to impose that on other people.
But thanks anyway and I might go for the cElementTree later on, if the
ordinary SAX proves to slow...
On Wed, 20 Apr 2005 08:03:00 -0400,
Kent Johnson wrote:
> Willem Ligtenberg wrote:
>>>Willem Ligtenberg <WLigtenberg at gmail.com> wrote:
>>>
>>>>I want to parse a very large (2.4 gig) XML file (bioinformatics
>>>>ofcourse :)) But I have no clue how to do that. Most things I see read
>>>>the entire xml file at once. That isn't going to work here ofcourse.
>>>>
>>>>So I would like to parse a XML file one record at a time and then be
>>>>able to store the information in another object. How should I do
>>>>that?
>>
>> The XML file I need to parse contains information about genes.
>> So the first element is a gene and then there are a lot sub-elements with
>> sub-elements. I only need some of the informtion and want to store it in
>> my an object called gene. Lateron this information will be printed into a
>> file, which in it's turn will be fed into some other program.
>> This is an example of the XML
>> <?xml version="1.0"?>
>> <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
>> <Entrezgene-Set>
>> <Entrezgene>
>> <snip>
>> </Entrezgene>
>> </Entrezgene-Set>
>
> This should get you started with cElementTree:
>
> import cElementTree as ElementTree
>
> source = 'Entrezgene.xml'
>
> for event, elem in ElementTree.iterparse(source):
> if elem.tag == 'Entrezgene':
> # Process the Entrezgene element
> geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
> print 'Gene id', geneid
>
> # Throw away the element, we're done with it
> elem.clear()
>
> Kent
More information about the Python-list
mailing list