XML parsing per record
Willem Ligtenberg
WLigtenberg at gmail.com
Thu Apr 21 09:17:48 EDT 2005
Sorry I just decided that I want to use your solution, but I am wondering
is cElemenTree in expat or is that something different?
On Wed, 20 Apr 2005
08:03:00 -0400, Kent Johnson wrote:
> Willem Ligtenberg wrote:
>>>Willem Ligtenberg <WLigtenberg at gmail.com> wrote:
>>>
>>>>I want to parse a very large (2.4 gig) XML file (bioinformatics
>>>>ofcourse :)) But I have no clue how to do that. Most things I see read
>>>>the entire xml file at once. That isn't going to work here ofcourse.
>>>>
>>>>So I would like to parse a XML file one record at a time and then be
>>>>able to store the information in another object. How should I do
>>>>that?
>>
>> The XML file I need to parse contains information about genes.
>> So the first element is a gene and then there are a lot sub-elements with
>> sub-elements. I only need some of the informtion and want to store it in
>> my an object called gene. Lateron this information will be printed into a
>> file, which in it's turn will be fed into some other program.
>> This is an example of the XML
>> <?xml version="1.0"?>
>> <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
>> <Entrezgene-Set>
>> <Entrezgene>
>> <snip>
>> </Entrezgene>
>> </Entrezgene-Set>
>
> This should get you started with cElementTree:
>
> import cElementTree as ElementTree
>
> source = 'Entrezgene.xml'
>
> for event, elem in ElementTree.iterparse(source):
> if elem.tag == 'Entrezgene':
> # Process the Entrezgene element
> geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
> print 'Gene id', geneid
>
> # Throw away the element, we're done with it
> elem.clear()
>
> Kent
More information about the Python-list
mailing list