[Tutor] how to extract text by specifying an element using ElementTree

Wed Dec 21 06:05:40 CET 2005

Dear drs. Yoo and johnson, 
Thank you very much for your help. I successully
parsed my GO annotation from all 16,000 files. 
thanks again for your kind help

--- Danny Yoo <dyoo at hkn.eecs.berkeley.edu> wrote:

> 
> > >>> for m in mydata.findall('//functions'):
> > 	print m.get('molecular_class').text
> >
> > >>> for m in mydata.findall('//functions'):
> > 	print m.find('molecular_class').text.strip()
> >
> > >>> for process in
> > mydata.findall('//biological_process'):
> > 	print process.get('title').text
> 
> 
> Hello,
> 
> I believe we're running into XML namespace issues. 
> If we look at all the
> tag names in the XML, we can see this:
> 
> ######
> >>> from elementtree import ElementTree
> >>> tree = ElementTree.parse(open('00004.xml'))
> >>> for element in tree.getroot()[0]: print
> element.tag
> ...
> {org:hprd:dtd:hprdr2}title
> {org:hprd:dtd:hprdr2}alt_title
> {org:hprd:dtd:hprdr2}alt_title
> {org:hprd:dtd:hprdr2}alt_title
> {org:hprd:dtd:hprdr2}alt_title
> {org:hprd:dtd:hprdr2}alt_title
> {org:hprd:dtd:hprdr2}omim
> {org:hprd:dtd:hprdr2}gene_symbol
> {org:hprd:dtd:hprdr2}gene_map_locus
> {org:hprd:dtd:hprdr2}seq_entry
> {org:hprd:dtd:hprdr2}molecular_weight
> {org:hprd:dtd:hprdr2}entry_sequence
> {org:hprd:dtd:hprdr2}protein_domain_architecture
> {org:hprd:dtd:hprdr2}expressions
> {org:hprd:dtd:hprdr2}functions
> {org:hprd:dtd:hprdr2}cellular_component
> {org:hprd:dtd:hprdr2}interactions
> {org:hprd:dtd:hprdr2}EXTERNAL_LINKS
> {org:hprd:dtd:hprdr2}author
> {org:hprd:dtd:hprdr2}last_updated
> ######
> 
> (I'm just doing a quick view of the toplevel
> elements in the tree.)
> 
> As we can see, each element's tag is being prefixed
> with the namespace URL
> provided in the XML document.  If we look in our XML
> document and search
> for the attribute 'xmlns', we'll see where this
> 'org:hprd:dtd:hprdr2'
> thing comes from.
> 
> 
> So we may need to prepend the namespace to get the
> proper terms:
> 
> ######
> >>> for process in
>
tree.find("//{org:hprd:dtd:hprdr2}biological_processes"):
> ...     print
> process.findtext("{org:hprd:dtd:hprdr2}title")
> ...
> Metabolism
> Energy pathways
> ######
> 
> 
> To tell the truth, I don't quite understand how to
> work fluently with XML
> namespaces, so perhaps there's an easier way to do
> what you want.  But the
> examples above should help you get started parsing
> all your Gene Ontology
> annotations.
> 
> 
> 
> Good luck!
> 
> 

Send instant messages to your online friends http://in.messenger.yahoo.com