[XML-SIG] New Reader Architecture

Uche Ogbuji uche.ogbuji@fourthought.com
Sun, 05 Nov 2000 10:46:13 -0700


We have rewitten most of the code used for creating text from DOMs. 
I've cc'ed xml-sig because the check-ins of 4DOM I'll be making today
reflect these changes.

All the changes described here are intended for the 4Suite 0.9.2 release
but are reflected in the CVS snapshot I just put up today.

I won't go over much of the legacy except as it occurs in 4DOM.  In the
past, to read, you would do as follows:

from xml.dom.ext.reader import Sax2
xml_dom_object = Sax2.FromXml(text)

or

from xml.dom.ext.reader import HtmlLib
html_dom_object = Sax2.FromHtml(text)

This worked well for a while, and this interface will still be supported
although it is now deprecated.  The problem is that as other parts of
4Suite evolved various DOM subsets to deal with footprint and
performance problems, it was becoming nearly impossible to properly
parameterize the particular type of DOM in use for, say XSLT processing.

Finally, when we added XPointer we had to find a way to allow it return
a type of DOM according to user configuration, so that, say if you
pointed 4XPointer at

http://xslt.fourthought.com/docbook_html1.xslt#xpointer(/*/*[3])

Some users could choose to get back 4DOM representing the document
sub-set while the code that implements the document() function in XSLT
would be able to match the returned type of DOM to the one in the
original source document (say cDomlette or pDomlette).

The solution I came up with for parameterizing DOM types like this was
to create at least one reader class for each DOM type.  This Reader
class is responsible for parsing and releasing nodes, and could hold any
global state for a group of DOM subtrees (say the collection of source
documents used in XSLT).  cDomlette, which is only designed to work at
low-level with Expat, has only the RawExpatReaader class, but pDomlette
has a PyExpatReader and SaxReader class.  4DOM has the usual assortment
of classes: Sax, Sax2, HtmlLib, but I have added a PyExpat reader there
as well.

Using one of the new reader classes is also simple.  You create an
instance passing in to the constructor any parameters relevant to the
state of that class.  For instance, Sax reader classes support
validation and so you can pass a validate flag to the initializer of
each instance of such classes:

from xml.dom.ext.reader import Sax2
reader = Sax2.Reader(validate=1)

or

from Ft.Lib.pDomlette import SaxReader
reader = SaxReader(validate=1)

However, Expat supports no validation so there is no such initializer
argument in the specialized expat readers:

from xml.dom.ext.reader import PyExpat
reader = PyExpat.Reader()

or

from Ft.Lib.pDomlette import PyExpatReader
reader = PyExpatReader()

The initializer parameters are used for all reading (unless changed
through direct attribute manipulation), so if you are using Sax you
might want to have multiple reader instances in your code.  One for
validating parses and one for non-validating parsers.

[Note that you can now directly specify the desired parser in any SAX
reader.]

Once you have the reader instance, you use the fromStream or fromUri
method to create each DOM.  The equivalents to the other common utility
reader functions (say fromString or fromFile) have been eliminated for
simplicity since it is trivial to turn text or a filename into a
stream.  fromUri was provided because 4Suite now supports URI handlers
and the conversion from URI to stream might not be as straightforward as
using Python's urllib.

fromStream accepts a stream object as its first parameter and an
optional ownerdocument as its second.  If the ownerdocument is given,
the return from the method will be a DocumentFragment instance,
otherwise it will be a Document instance.

xml_doc = reader.fromStream(stream)

or

xml_docfrag = reader.fromStream(stream, ownerDoc)

[Note that the Domlette readers also have an argument to fromStream,
stripElements, for specifying elements from which white-space is to be
stripped while building the DOM.  This is merely to support some
internal XSLT optimizations until a better way can be found.  Using
these arguments is deprecated and they may be removed from the method
signatures in any future 4Suite release.]

Python 1.x users can break circular dependencies by calling the
releaseNode method on the reader that was used to create the DOM:

reader.releaseNode(xml_doc)

Comments and bug reports welcome.  We shall update the API docs to
reflect these changes before releasing 4Suite 0.9.2.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python