[XML-SIG] SAX prettyprinter V2 and SGMLOP

Christian Tismer tismer@appliedbiometrics.com
Mon, 25 Jan 1999 21:50:30 +0100


Walter Underwood wrote:
> 
> At 04:44 PM 1/23/99 +0100, Christian Tismer wrote:
> >What I need to find is the fastest acceptable parser which allows
> >me to turn masses of XML data into Python structures. [...] we are
> >processing XML encoded database records which are quite irregular
> >(useless to use a relational database) and quite simple, but the
> >standard size is some 50MB. This is why I'm after speed, much more than
> >conformance.
> 
> I'm using pyexpat for the XML support in our search engine.
> At this point in development, I'm collecting text and associating
> it with *every* enclosing element. So this is worst-case for
> parsing time.
> 
> Parsing Jon Bosak's tagged "Old Testament" (3.4 megabytes) takes
> 30 seconds. That document is pretty heavily tagged, with an element
> for each verse, each chapter, each book, the body, etc.
> 
> Collecting less information would probably be faster.

Interesting. I tested my Indenter with this file
(what a nice example).
It takes 11.75 seconds to indent this through SAX, using sgmlop.
With xmlproc, it takes 30.87 seconds.
Running the whole text through sgmlop without any
associated events ran in below one second.

> If you need a lot more speed than this (integer factors faster)
> you might need to do all the parsing in C. Remember that there
> is a difference between a paser that implements all of XML and
> a parser that extracts the data you need from your XML documents.
> If you can trust the documents to be legal (perhaps they are
> checked when generated), then a hard-coded parser may be the
> answer.

Well, both is true. I want to validate small amounts of newly
added data "records" which are in XML format, but then
kept in a special repository, and I want to be able to
re-import large amounts of XML which were exported by my
app before. This means, I need a validating parser of
acceptable speed, where I think xmlproc is very good?
And I need something that simply eats large amounts
of approved data.
But I won't go so far to code this all in C since these imports
will not be so frequent. I would even prefer to do it all
in Python if possible.

There are also cases where even sgmlop does much more
than I need. There are applications where I just want to
know where the tags start and end, and I don't want
substitutions, no parsing and reordering of parameters,
just to be able to juggle with unmodified pieces of XML.

Therefore I proposed an XML scanner which just provides
the tools to build up what you actually need. Maybe I
overlooked it and we have that already somewhere.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101   :    *Starship* http://starship.skyport.net
10553 Berlin                 :     PGP key -> http://pgp.ai.mit.edu/
     we're tired of banana software - shipped green, ripens at home