[Tutor] Parsing a block of XML text

Sat Jan 1 06:44:37 CET 2005

On Fri, 31 Dec 2004, kumar s wrote:

> I am trying to parse BLAST output (Basic Local Alignment Search Tool,
> size around more than 250 KB ).

[xml text cut]

Hi Kumar,

Just as a side note: have you looked at Biopython yet?

    http://biopython.org/

I mention this because Biopython comes with parsers for BLAST; it's
possible that you may not even need to touch XML parsing if the BLAST
parsers in Biopython are sufficiently good.  Other people have already
solved the parsing problem for BLAST: you may be able to take advantage of
that work.

> I wanted to parse out :
>
> <Hsp_query-from> <Hsp_query-out)
>  <Hsp_hit-from></Hsp_hit-from>
>   <Hsp_hit-to></Hsp_hit-to>

Ok, I see that you are trying to get the content of the High Scoring Pair
(HSP) query and hit coordinates.

> I wrote a ver small 4 line code to obtain it.
>
> for bls in doc.getElementsByTagName('Hsp_num'):
> 	bls.normalize()
> 	if bls.firstChild.data >1:
> 		print bls.firstChild.data

This might not work.  'bls.firstChild.data' is a string, not a number, so
the expression:

    bls.firstChild.data > 1

is most likely buggy.  Here, try using this function to get the text out
of an element:

###
def get_text(node):
    """Returns the child text contents of the node."""
    buffer = []
    for c in node.childNodes:
        if c.nodeType == c.TEXT_NODE:
            buffer.append(c.data)
    return ''.join(buffer)
###

(code adapted from: http://www.python.org/doc/lib/dom-example.html)

For example:

###
>>> doc = xml.dom.minidom.parseString("<a><b>hello</b><b>world</b></a>")
>>> for bnode in doc.getElementsByTagName('b'):
...     print "I see:", get_text(bnode)
...
I see: hello
I see: world
###

> Could any one help me directing how to get the elements in that tag.

One way to approach structured parsing problems systematically is to write
a function for each particular element type that you're trying to parse.

>From the sample XML that you've shown us, it appears that your document
consists of a single 'Hit' root node.  Each 'Hit' appears to have a
'Hit_hsps' element.  A 'Hit_hsps' element can have several 'Hsp's
associated to it.  And a 'Hsp' element contains those coordinates that you
are interested in.

More formally, we can structure our parsing code to match the structure
of the data:

### pseudocode ###
def parse_Hsp(node):
    ## get at the Hit_hsps element, and call parse_Hit_hsps() on it.

def parse_Hit_hsps(node):
    ## get all of the Hsp elements, and call parse_Hsp() on each one of
    ## them.

def parse_Hsp(node):
    ## extract the query and hit coordinates out of the node.
######

To see another example of this kind of program structure, see:

    http://www.python.org/doc/lib/dom-example.html

Please feel free to ask more questions.  Good luck to you.