[Tutor] Parsing a block of XML text
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Sat Jan 1 06:44:37 CET 2005
On Fri, 31 Dec 2004, kumar s wrote:
> I am trying to parse BLAST output (Basic Local Alignment Search Tool,
> size around more than 250 KB ).
[xml text cut]
Hi Kumar,
Just as a side note: have you looked at Biopython yet?
http://biopython.org/
I mention this because Biopython comes with parsers for BLAST; it's
possible that you may not even need to touch XML parsing if the BLAST
parsers in Biopython are sufficiently good. Other people have already
solved the parsing problem for BLAST: you may be able to take advantage of
that work.
> I wanted to parse out :
>
> <Hsp_query-from> <Hsp_query-out)
> <Hsp_hit-from></Hsp_hit-from>
> <Hsp_hit-to></Hsp_hit-to>
Ok, I see that you are trying to get the content of the High Scoring Pair
(HSP) query and hit coordinates.
> I wrote a ver small 4 line code to obtain it.
>
> for bls in doc.getElementsByTagName('Hsp_num'):
> bls.normalize()
> if bls.firstChild.data >1:
> print bls.firstChild.data
This might not work. 'bls.firstChild.data' is a string, not a number, so
the expression:
bls.firstChild.data > 1
is most likely buggy. Here, try using this function to get the text out
of an element:
###
def get_text(node):
"""Returns the child text contents of the node."""
buffer = []
for c in node.childNodes:
if c.nodeType == c.TEXT_NODE:
buffer.append(c.data)
return ''.join(buffer)
###
(code adapted from: http://www.python.org/doc/lib/dom-example.html)
For example:
###
>>> doc = xml.dom.minidom.parseString("<a><b>hello</b><b>world</b></a>")
>>> for bnode in doc.getElementsByTagName('b'):
... print "I see:", get_text(bnode)
...
I see: hello
I see: world
###
> Could any one help me directing how to get the elements in that tag.
One way to approach structured parsing problems systematically is to write
a function for each particular element type that you're trying to parse.
>From the sample XML that you've shown us, it appears that your document
consists of a single 'Hit' root node. Each 'Hit' appears to have a
'Hit_hsps' element. A 'Hit_hsps' element can have several 'Hsp's
associated to it. And a 'Hsp' element contains those coordinates that you
are interested in.
More formally, we can structure our parsing code to match the structure
of the data:
### pseudocode ###
def parse_Hsp(node):
## get at the Hit_hsps element, and call parse_Hit_hsps() on it.
def parse_Hit_hsps(node):
## get all of the Hsp elements, and call parse_Hsp() on each one of
## them.
def parse_Hsp(node):
## extract the query and hit coordinates out of the node.
######
To see another example of this kind of program structure, see:
http://www.python.org/doc/lib/dom-example.html
Please feel free to ask more questions. Good luck to you.
More information about the Tutor
mailing list