[XML-SIG] SAX 2.0, again

Lars Marius Garshol larsga@garshol.priv.no
21 Feb 2000 09:23:36 +0100


Some weeks ago David Megginson released a SAX 2.0 beta in Java, and
this release appears to be quite close to the final form of SAX 2.0.
I've started working on translating this release into Python, but
there are some general design issues that need to be thought through
before this can be completed.


### XML names

The first problem is that of how to represent XML names. SAX 2.0 can
handle namespaces, and so we must somehow represent namespace-names.
I can see several different ways of doing this, all with their
advantages and disadvantages, and would very much like to hear the
opinion of the XML-SIG on this.

The alternatives I've thought of are

 - use (uri, localpart) tuple for namespace-names, simple strings for
   ordinary names

 - use (uri, localpart, rawname) for namespace-names, simple strings
   for ordinary names; rawname must be communicated out of band
   somehow

 - use XMLName objects for names, regardless of kind. If these were
   made immutable and drivers used hashtables of these this might not
   be too inefficient.

 - use separate parameters for uri, localpart and rawname, letting
   some of these be None depending on what was in the document and
   what the parser supports.


### Driver maintenance

Given that SAX 2.0 is larger than SAX 1.0 and also supports various
possibilities for extensions, writing a good and complete SAX 2.0
driver can be quite a bit of work. If any parser writers or others
feel like contributing to this work by writing and maintaining
drivers, then please feel encouraged to do so.

If nobody does write drivers, I will do it, but it will probably take
longer and they may not be as complete.


### Unicode support

Python 1.6 will have Unicode support, and so we should make PySAX 2.0
Unicode-ready. The main part of this is really adding the InputSource
object to the library, since this allows applications to feed byte or
character streams to the parser in a convenient way.

The question is: how will this distinction look in Python 1.6? Will
there be one? How should we relate to it? 

Could we do it simply by using file-like objects with different
semantics? 


### easySAX vs Pyxie

What should we do with this? Should we try to turn Pyxie into what we
envisioned easySAX to be, or should we maintain two such libraries? I
see advantages and disadvantages to both approaches.

One idea I've had for easySAX is something inspired by John Aycock's
Spark parser generator, that one could write SAX document handlers
with three kinds of special methods: start-element, end-element and
element content methods. These could use the 's_', 'e_' and 'c_'
prefixes, respectively.

Unlike in xmllib, though, the names of these methods would have no
significance beyond the prefix. Instead, the documentation string
could contain very simple XPath expressions to be used to dispatch
events onto the various methods.

This should allow us to write easySAX applications that look somewhat
like this (self.out is some XML generator class which may or may not
be part of easySAX):

class MyHandler(GenericEZSAXHandler):

  def s_doc(self, attrs):
    ' document '
    self.out.write_template("top")

  def c_sec_title(self, contents, attrs):
    ' section / title '
    self.out.make_element('h1', contents)

  def c_subsec_title(self, contents, attrs):
    ' subsection / title '
    self.out.make_element('h2', contents)

  def e_doc(self):
    ' document '
    self.out.write_template("bottom")


I'm fairly confident that a layer on top of SAX 2.0 to enable such
easySAX applications could be made fairly fast and it should be pretty
easy to implement as well. (I've made an early sketch of this.)

The only question is what to do with namespace-names. Perhaps the
application could declare constant namespace prefixes to be used in
the documentation strings in its constructor.


--Lars M.