[XML-SIG] parsing xml schema

Alan Kennedy pyxml@xhaus.com
Fri, 23 Nov 2001 17:00:31 +0000


Uche Ogbuji wrote:

> Don't hesitate to ask if yo need help with this task.  In fact, if you were
> able to write up what you did to use PyTREX as a validator I would love to
> make this available to others.

OK, here is how I see it.

Basically, I need to do validation of XML data files. These may either be from
textual xml data that is submitted to the application, or on DOM structures that
have been retrieved from some storage(content repos, pickled DOM in RDBMS?). The
DOM structures are very likely to be either pDomlette, or 4Suite 0.12 R/W
cDomlettes (he said hopefully ;-)

Another, perhaps more esoteric, case is where the TREX pattern is stored in a DOM,
having perhaps been generated from an XSLT transform, although off-hand I can't
picture any use cases for such a scenario?

It is very likely to be the case that I will need a persistable "compiled" version
of the trex pattern, since I will have a set of 10 to 100 handwritten trex
patterns that will be used continually, and I don't want to parse them each time.
It is quite likely I could just pickle the pattern after parsing, but that remains
to be verified.

Validating textual XML is simple. Just create pyTrex instances from the textual
XML, using the "parse_Instance" function, create a trex instance from the textual
trex file, using the "parse_Trex" function. And then use the "validate" function
to match the former against the latter.

However, it is more complex when it comes to DOMs, mainly because pyTrex uses
non-SAX/DOM interfaces in order to speed things up as much as possible.
Efficiently integrating with [cp]Domlette  is non-trivial, for the following
reasons.

1. pyTrex uses the pyExpat (non-SAX) callback interfaces directly, presumably to
increase speed.

2. pyTrex uses its own internal non-DOM object model to store the document and
schema representations, again presumably for speed. This is a good design choice,
since pyTrex does not need the wealth of DOM pointers (sibling, parent, etc) to do
its job: it just needs one-way, down-pointing parent to child relationships.

The way I see it, there are four possible approaches I can take to validate a DOM
structure.

1. Serialise the DOM to a string, and let pyTrex re-parse the string to build up
its own data structures. Advantages: minimal extra coding. Disadvantages: 1. Speed
inefficiency, since the XML is parsed twice. 2. Memory inefficiency, since both
the DOM structure and the pyTrex object model will be present in memory at the
same time 3. Can only use pyExpat as a parser.

2. Rewrite the pyTrex HandlerBase class to take its event calls from a SAX(2)
stream. Advantages: 1. Can use any SAX compliant parser and 2. Eliminates the
double parsing problem, since can generate a SAX stream by tree-walking an
existing DOM. Disadvantages: Memory inefficiency, since both the DOM structure and
the pyTrex object model will be present in memory at the same time.

3. Rewrite the pyTrex "parse_Instance" function and dependent classes so that it
augments an existing DOM structure with whatever attributes it needs. Advantages:
Memory effiency, since both parallel object models are stored in the same
structure. Disadvantages: Fair amount of code rewriting, and thus debugging (which
I don't fancy much) .

4. Rewrite the pyTrex pattern matching routines so that they operate off a DOM
structure instead of the proprietary pyTrex structure. Advantages: best possible
memory efficiency. Disadvantages: A *lot* of code rewriting, and consequent
debugging. I don't think I fancy opening that little can of worms.

As things stand for me now, I think I am very likely to opt for option 1, since it
involves the minimum amount of coding.

However, I may yet go for option 2, if the overhead of parsing my data files
(which will alll be < 100K text) turns out to be large. Or it may turn out to be
the case (not sure yet, still spec'ing requirements) that I won't bother with the
second (DOM) parse if the first (validation) parse fails, since a failed
validation means I'm not interested in the file anyway. Thinking about it some
more, this is very likely to be the case, since I really only need the TREX
validation to act as a gate-keeper against bad data files coming into my system.
Once it's in the system, I shouldn't need to validate it again.

I don't think the memory inefficiency inherent in options 1 and 2 is large, since
the pyTrex data structures are so light.

I can't see myself going for options 3 or 4, since that would involve rewriting of
pretty complex code, a place where I can't afford to go right now.

One last requirement that I don't have now, but could foresee myself having  in
the future: Validating the output of a (XSLT) transformation, to be absolutely
certain that the transform is generating valid output. This is really a system
testing requirement, used only in system development and QA phases, unlikely to be
required at run time. The reason why I mention this is that our last contract  was
a QA contract where we were testing a system written in ASP. A lot of crud came
out of that system, and it handled multi-browser considerations partulcularly
badly. Some form of automated validation of the ouput HTML/etc might have
prevented a lot of wild-goose chasing.

Regards,

Alan.