XML parsing per record
Kent Johnson
kent37 at tds.net
Wed Apr 20 08:03:00 EDT 2005
Willem Ligtenberg wrote:
>>Willem Ligtenberg <WLigtenberg at gmail.com> wrote:
>>
>>>I want to parse a very large (2.4 gig) XML file (bioinformatics
>>>ofcourse :)) But I have no clue how to do that. Most things I see read
>>>the entire xml file at once. That isn't going to work here ofcourse.
>>>
>>>So I would like to parse a XML file one record at a time and then be
>>>able to store the information in another object. How should I do
>>>that?
>
> The XML file I need to parse contains information about genes.
> So the first element is a gene and then there are a lot sub-elements with
> sub-elements. I only need some of the informtion and want to store it in
> my an object called gene. Lateron this information will be printed into a
> file, which in it's turn will be fed into some other program.
> This is an example of the XML
> <?xml version="1.0"?>
> <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
> <Entrezgene-Set>
> <Entrezgene>
> <snip>
> </Entrezgene>
> </Entrezgene-Set>
This should get you started with cElementTree:
import cElementTree as ElementTree
source = 'Entrezgene.xml'
for event, elem in ElementTree.iterparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid
# Throw away the element, we're done with it
elem.clear()
Kent
More information about the Python-list
mailing list