[XML-SIG] dom building, sax, and namespaces

Andrew Dalke dalke@acm.org
Fri, 25 Jan 2002 05:48:51 -0700


Me:
> > Please correct me if I'm wrong.  Doesn't XMLGenerator convert the SAX
> > events to a text stream?

Sylvain Thenault wrote:
> no, XMLGenerator produce a DOM tree using 4DOM implementation.

Then what's xml.sax.saxutils.XMLGenerator?
# --- XMLGenerator is the SAX2 ContentHandler for writing back
XML                         

> but I don't know if it's in PYXML
> 0.7 or if it's still only in the CVS.
> BTW, which version of PyXML and/or 4Suite are you using ?

I was using PyXML 0.7.  I just pulled the latest out of CVS.

josiah> grep -l XMLGenerator `find . -name '*.py'`
./test/test_sax.py
./test/test_sax_xmlproc.py
./test/test_saxdrivers.py
./test/test_sax2.py
./xml/sax/expatreader.py
./xml/sax/saxutils.py
josiah> grep XMLGenerator xml/sax/saxutils.py
# --- XMLGenerator is the SAX2 ContentHandler for writing back XML
class XMLGenerator(handler.ContentHandler):
class LexicalXMLGenerator(XMLGenerator, saxlib.LexicalHandler):
    """A XMLGenerator that also supports the LexicalHandler interface"""
        XMLGenerator.__init__(self, out, encoding)
class ContentGenerator(XMLGenerator):
        return XMLGenerator.characters(self, str[start:start+end])

That's the one I saw before.  I looked at the source code for that
class.  No mention of DOM at all.

I went to download 4Suite from CVS.  I followed the CVS login commands
at ftp://ftp.fourthought.com/pub/cvs-snapshots/ but it says

josiah> cvs -z3 -d:pserver:anonymous@cvs.4suite.org:/var/local/cvsroot
co \
 -R STABLE FT
cvs server: cannot find module `STABLE' - ignored
cvs server: cannot find module `FT' - ignored
cvs [checkout aborted]: cannot expand modules
josiah>

so I downloaded 2002-01-25-4Suite.tar.gz .
josiah> zcat 2002-01-25-4Suite.tar.gz | tar xf -
josiah> cd 4Suite/
josiah> grep -l XMLGenerator `find . -name '*.py'`
josiah>

I don't see the XMLGenerator you're talking about.

> I recommend you to read the W3C XPATH recommandation:
> http://www.w3.org/TR/xpath

I have tried.  I find it hard going.  Something hasn't yet clicked
on how the data model works.  That's also why I'm having problems
working with the DOM.  I know it will come with practice, but it's
annoying in the meanwhile.

> >   Matching something like '//bioformat:dbid[@type="primary"]' is
> > 25 times faster in SAX than DOM, except of course that the SAX code
> > I wrote is only limited to single node evaluations.

> how do you apply an xpath on sax events ?

In my case, I can special-case the XPath and if it's a specific
restricted syntax I can implement what I want with a special-purpose
SAX handler.  Else I make a DOM out of the events and do the XPath
query on that, then get the matched text.

> isn't 20 minutes to process a file too long ?

My standard test case is 227 MB.  'grep ^ID | wc', which returns a
count of the number of records in the file, takes a minute.  (My main
development machine is my 233 MHz laptop.)

> Maybe should you think to use a database which could be queried using
> XPATH ? (a database seem to be more adapted to your amount of data)

I have a spectrum of solutions based on the technology I'm developing.
The one I'm focusing on now is a BerkeleyDBM solution for simple id ->
flat
file record lookups.  This fits in very closely with existing solutions,
except for the use of XPath as the query language.

The next step is to use an XML database.  This is harder sell for now
because no one I know in this field uses an XML database -- most are
using relational databases.  And once I mention "database server" they
start fretting about getting a database manager, or they say their
Oracle person has no experience with XML databases.

By having an intermediate solution for simple searches, it's an easier
path to having a more complex database, since the API for existing
tasks stays the same.


> if bioformat:dbid is always a child of bioformat:record,
> //bioformat:record/bioformat:dbid[@type='primary'] should be faster (less
> solutions to explore)
> Same thing may be applied if bioformat:record is always a child of your
> root element.

It isn't.  Here's the data flow

                                            [existing flat file]
                                                    |
                                                   \/
  [format definition as]--> parser generator --> [parser]
  [a regular expression]                            |
                                                   \/
                                             [SAX events in Python]
                                               /   |    \
                                         Special  DOM   Database
                                         purpose
                                         handlers

The structure of the SAX events is the same as the original file,
since I'm only adding markup.  The format definition which produces
the markup may include other intermediate elements, and I don't know
what those might be beforehand.  I can define some structure to it
all, but mostly of the sort that

  "feature_location" must be a descendent of "feature"

I can make no assertions as to how far that descendency is.  Hence
my liberal use of '//'s.

> XPATH is rather easy to understand with a litle look at the
> documentation. In order to use it on an xml document, you also have to
> know the document structure.

I insist that I have read it several times, read Kay's book, and read
the 'Python&XML' book on the topic.  I don't find it easy.  For example,
I definitely found SQL easier, in that I could understand it by looking
at a few examples, rather than needing to read the documntation first.
Now, I know the SQL data model is less complicated than XML's, but SQL
is what all my potential customers are used to using, so at the very
least I have to convince them that the complications are worth it.


					Andrew
					dalke@dalkescientific.com