[XML-SIG] A usage scenario for Python and XML...

Greg Wolff pwolff@cox.rr.com
Sat, 03 Jun 2000 18:04:37 -0400


Hello,

I've been lurking, I mean following, the discussion of Python and XML
implementations on the Python XML Sig list for a while now.  I had one
particular email exchange with Fred Drake about the SAX versus SAX2
issue.  It might be useful if I elaborate to the whole group on the
example I gave of why I want SAX2.  The Expat C interface includes
features that I use that are not in SAX, but are in SAX2, if I read the
documentation correctly.  I hope I don't have to hack the driver
interface to get them, but if my current understanding of SAX2 is
correct, I won't need to...

I'm intending to use Python extensions in ZOPE to build an e-pub
system.  
I am the architect of several large SGML/XML based web publishing
systems.  Although these systems are not constructed with ZOPE, but
first on C++/NSAPI and currently on Vignette StoryServer with C++ DLOAD
modules, I think some of the relevant experience may be of use to this
discussion.

First, a brief discussion of requirements.  The purpose of my
implementations are for large scale online publishing with massive
document sets in SGML/XML.  Full DTD document models apply.  Document
fragments must be pulled as needed and formatted for display as HTML. 
Multiple styles of presentation must be applicable to any particular
fragment as needed for webGUI presentation.  Documents are very large,
many 10s of megabytes in some cases with up to 15 levels of hierarchy
supported.  Sophisticated SGML/XML structure aware search is used to do
full text search.  Individual documents must be useable in multiple end
user published products at the same time without embedding any product
specific info in the documents.

Implementation:  

An inverted index of the "relevant" portion of the XML object hierarchy
is built but the document is not broken up into its component objects. 
Document files are stored as an XML character stream.  When a user
desires to view a particular web page, that page is constructed on the
fly and presented.  Caching of the final documents is used.

Page Generation:  

The XML document fragment is pulled, the inversion tells us the byte
offsets for the start and end tags of the particular fragment of
interest.  The fragment is run through a SAX based XML to HTML
conversion object that takes in the fragment, the style to use and
control information.

NOTE: Byte offsets are not available in SAX but are in SAX2.  They are
available in Expat and the Perl Expat modules which we currently use for
this purpose.  The older Java SAX api is unusable for this application.

Performance:

The conversion from XML to HTML runs at just about a mega byte per
second at this time on about 300 mhz class Linux boxes and is faster on
big Sparc machines with fast memory back planes.  It slows down as less
information is thrown away in the conversion.  Time is directly
proportional to the amount of data read in and put out.  Usually, much
less data comes out than goes in because an individual web page is small
relative to the megabyte size input fragment.

Implications:

SAX (Expat) can stream across a large document at amazing
speed.  The event driven document handler approach allows a complete
style conversion to be performed and produce an output HTML file with
CSS and the works.  

Major Requirement: Complete location information, including byte
offsets, are required at all relevant element start and end tag
instances.

/pgw
Greg Wolff
pwolff@cox.rr.com