[XML-SIG] dom building, sax, and namespaces

Sylvain Thenault Sylvain.Thenault@logilab.fr
Fri, 25 Jan 2002 12:42:31 +0100 (CET)


On Wed, 23 Jan 2002, Andrew Dalke wrote:

> Sylvain Thenault:
> > you shouldn't have to use an adapter. The example in the previous mail you
> > posted works when you use XMLGenerator instead of pulldom.
> 
> Please correct me if I'm wrong.  Doesn't XMLGenerator convert the SAX
> events to a text stream?  So if I want to create a DOM I would need to
> reparse that stream?  If so, I would rather not have the intermediate
> data structure.

no, XMLGenerator produce a DOM tree using 4DOM implementation. I have
added an "implementation" parameter to the constructor if you want to use
a different implementation than 4DOM, but I don't know if it's in PYXML
0.7 or if it's still only in the CVS.
BTW, which version of PyXML and/or 4Suite are you using ?
  
> > > (BTW, is there no built-in function to get the concatenation of all
> > > the text nodes, like my get_text function, below?)
> > 
> > I don't know if there is one in 4Suite or PyXML, but I think this is a job
> > for XSLT)
> 
> I thought it was one for XPath, expecting I could use .text() to get
> that data.  However, there appears to be a problem with that.
> >>> xml.xpath.Compile('spam[@eggs="yes"].text()')
[snip traceback]
                                           
your xpath expression isn't valid, it should be 'spam[@eggs="yes"]/text()'
Moreover, this won't return the concatenation of all text nodes but a
_node-set_ with all the matching nodes.
XPATH provide a 'string' function, but 'string(spam[@eggs="yes"]/text())'
will return the text value of the _first_ text node of the node-set.

I recommend you to read the W3C XPATH recommandation:
http://www.w3.org/TR/xpath 

> BTW, one problem I have with all of this is performance.  All I
> want is to get the text inside a region which matches a given XPath
> query.  Matching something like '//bioformat:dbid[@type="primary"]' is
> 25 times faster in SAX than DOM, except of course that the SAX code
> I wrote is only limited to single node evaluations.   (Don't get me
> wrong - given all that's going on with DOM, 25x ain't shabby!)

how do you apply an xpath on sax events ? 
DOM is a higher level than SAX (it may be based on sax to construct the
dom tree, as in your application), so it's obvious that an application
which does calculation on DOM trees is slower and consumes more memory
than an application which does calculation on the sax events, without
previously building a representation of the full xml tree.
 
> As I said, I have a way to do XML markup of existing flat-files.  What
> I'm writing is a way to index a flat-file.  I want to use XPath
> queries to define which fields should be indexed, as in
> 
>   mindy_index --id id="//bioformat:dbid[@type='primary']" \
>               --alias accession="//bioformat:dbid[@type='accession'] \
>               --alias author="//bioformat::author \
>               --record-tag bioformat:record \
>                ....
>               list of filenames
> 
> then be able to do a search
> 
>   mindy_search --accession=P8392
> or
>   mindy_search --author="Andrew Dalke"
> 
> and retrieve the original records.  In some sense this would be like a
> specialized way to do fast queries of the form
> 
>   (text of the) nodes named 'bioformat:record' which have a
>      bioformat::dbid[@type='accession'] descendent equal to "P8392"
> 
> Processing a file takes about 20 minutes.  Eight hours (25x more)
> would be rather too long.  So I might have some code to figure out a
> subset of XPath that I can handle in a specialized SAX handler.

isn't 20 minutes to process a file too long ?
Maybe should you think to use a database which could be queried using
XPATH ? (a database seem to be more adapted to your amount of data) 
 
> BTW, how do I  get that bioformat:record node list as an XPath
> expression?  I've stared at the spec and Kay's "XSLT Programmer's
> reference" and I can't figure it out.  I did find out that this
> 
>    //bioformat:record//bioformat:dbid[@type='primary']
> 
> takes a really long time to run -- minutes -- on my tiny dataset with
> only 8 records.  And I don't know what it does.

if bioformat:dbid is always a child of bioformat:record, 
//bioformat:record/bioformat:dbid[@type='primary'] should be faster (less
solutions to explore)
Same thing may be applied if bioformat:record is always a child of your
root element.

> So my other problem I'm running into is that I'm aiming this to
> be used by biologists and chemists.  If I'm having problems figuring
> out the XPath language, then I suspect they will in general have a
> harder go at it.

XPATH is rather easy to understand with a litle look at the
documentation. In order to use it on an xml document, you also have to
know the document structure.

-- 
Sylvain Thenault

  LOGILAB           http://www.logilab.org