An XML parser is an XML parser. Period.

Uche Ogbuji uche at ogbuji.net
Thu Feb 12 09:29:19 EST 2004


Peter Hansen:

> Hmm... makes me want to check their web site, to see what this is really
> about:
> 
> '''RXP is a very fast validating XML parser written by Richard Tobin of 
> the University of Edinburgh. It complies fully with the W3C test suites 
> (although we have compiled it without Unicode support for the time being). 
> We would like to thank Richard Tobin and Henry Thompson of the Language 
> Technology Group for making this code available to the world.
> '''
> 
> Seems pretty self-explanatory to me.  Might even be why, when I downloaded
> and tried to use it (and got good results) a year or two ago, I had no
> qualms about using it.  Clearly stated, and to the point, except that one
> is left to make the small connection between "compiled without Unicode
> support" and "doesn't handle character entities".  (Or is it that it 
> handles character entities, but not those beyond 127?  Probably moot.)
> 
> Doesn't this imply that anyone, at any time, could choose to recompile
> *with* Unicode support, which is presumably _in place_ but just optionally
> left out of the standard distribution?
> 
> So it's neither a bug, nor a design decision, but a packaging choice.
> 
> I think I'm back to saying that "not an XML parser!!!!" is a bit of an
> unfair reaction, given how open they are about the situation.

*sigh*.  I don't know how many more times and ways I can say this.  On
more time and I'm done unless a new, salient point comes up.

There *is* a packaging of PyRXP that is XML compliant.  It's called
PyRXPU.  It is precisely a compiling of PyRXP with Unicode support
plus output of Unicode objects in the resulting data structure (which
is my recommendation for XML processing).

So once more: AFAICT PyRXPU is an XML parser.  PyRXP is certainly not
an XML parser.  The substrate RXP is not an XML parser either when
compiled without Unicode support and although I respect Thompson and
Tobin as much as I do the PyRXP developers, they were really confusing
themselves and others when they said "It complies fully with the W3C
test suites (although we have compiled it without Unicode support for
the time being)."

Several early times when this issue was brought up the PyRXP
developers in effect said approximately: We need it to be fast, so we
won't be doing anything to make it conformant because we now doing so
would slow it down.  This is a pretty poisonous attitude when claimig
to support a standard, and what makes this even worse is that the
PyRXP Web page starts out saying:

"...PyRXP...the fastest validating XML parser available for Python,
and quite possibly anywhere :-)."

And then goes on to justify that statement with a "benchmark" of PyRXP
against other XML parsers without mentioning the inconvenient fact
that PyRXP is *not* an XML parser, and that building it so that it is
would drop it in the benchmarks somewhat.  (Not that I know who should
really care because unless you're using 4DOM or minidom all the
options are in the same order of magnitude: if you want to wring ut
the last odd drop of CPU--and you probably don't need to--then you
should be using neither XML nor Python).

Are you seriously telling me that in the face of all this, my
criticism, strongly worded as it is, is unfair?

My main aim here is to make it well known that PyRXP is not an XML
parser.  It won't trouble me if people continue to use it as currently
packaged.  I just want to make sure they know they are not using what
they may think they are.

Once again: PyRXPU (contributed, tellingly, by someone outside the
PyRXP core team) is the right build of PyRXP if you need an XML
parser.  The bad news is that it's only available from ReportLab CVS. 
My article is now out and includes details for obtaining PyRXPU:

http://www.xml.com/pub/a/2004/02/11/py-xml.html

--Uche
http://uche.ogbuji.net



More information about the Python-list mailing list