[XML-SIG] dom building, sax, and namespaces

Andrew Dalke dalke@acm.org
Wed, 23 Jan 2002 13:00:45 -0700


Sylvain Thenault:
> you shouldn't have to use an adapter. The example in the previous mail you
> posted works when you use XMLGenerator instead of pulldom.

Please correct me if I'm wrong.  Doesn't XMLGenerator convert the SAX
events to a text stream?  So if I want to create a DOM I would need to
reparse that stream?  If so, I would rather not have the intermediate
data structure.

> > (BTW, is there no built-in function to get the concatenation of all
> > the text nodes, like my get_text function, below?)
> 
> I don't know if there is one in 4Suite or PyXML, but I think this is a job
> for XSLT)

I thought it was one for XPath, expecting I could use .text() to get
that data.  However, there appears to be a problem with that.

>>> xml.xpath.Compile('spam[@eggs="yes"].text()')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File
"/usr/local/lib/python2.2/site-packages/_xmlplus/xpath/__init__.py",
line 86, in Compile
    raise RuntimeException(RuntimeException.INTERNAL, stream.getvalue())
xml.xpath.RuntimeException: There is an internal bug in 4XPath.  Please
report
this error code to support@4suite.org: Traceback (most recent call
last):
  File
"/usr/local/lib/python2.2/site-packages/_xmlplus/xpath/__init__.py",
line
79, in Compile
    return parser.new().parse(expr)
  File
"/usr/local/lib/python2.2/site-packages/_xmlplus/xpath/pyxpath.py", line
322, in parse
    raise SyntaxError(e.pos, e.msg, str)
SyntaxError: <unprintable instance object>
                                          

I've already sent them email.

BTW, one problem I have with all of this is performance.  All I
want is to get the text inside a region which matches a given XPath
query.  Matching something like '//bioformat:dbid[@type="primary"]' is
25 times faster in SAX than DOM, except of course that the SAX code
I wrote is only limited to single node evaluations.   (Don't get me
wrong - given all that's going on with DOM, 25x ain't shabby!)


As I said, I have a way to do XML markup of existing flat-files.  What
I'm writing is a way to index a flat-file.  I want to use XPath
queries to define which fields should be indexed, as in

  mindy_index --id id="//bioformat:dbid[@type='primary']" \
              --alias accession="//bioformat:dbid[@type='accession'] \
              --alias author="//bioformat::author \
              --record-tag bioformat:record \
               ....
              list of filenames

then be able to do a search

  mindy_search --accession=P8392
or
  mindy_search --author="Andrew Dalke"

and retrieve the original records.  In some sense this would be like a
specialized way to do fast queries of the form

  (text of the) nodes named 'bioformat:record' which have a
     bioformat::dbid[@type='accession'] descendent equal to "P8392"

Processing a file takes about 20 minutes.  Eight hours (25x more)
would be rather too long.  So I might have some code to figure out a
subset of XPath that I can handle in a specialized SAX handler.

BTW, how do I  get that bioformat:record node list as an XPath
expression?  I've stared at the spec and Kay's "XSLT Programmer's
reference" and I can't figure it out.  I did find out that this

   //bioformat:record//bioformat:dbid[@type='primary']

takes a really long time to run -- minutes -- on my tiny dataset with
only 8 records.  And I don't know what it does.

So my other problem I'm running into is that I'm aiming this to
be used by biologists and chemists.  If I'm having problems figuring
out the XPath language, then I suspect they will in general have a
harder go at it.

					Andrew
					dalke@dalkescientific.com