XML Parsing

Uche Ogbuji uche at ogbuji.net
Sat Feb 21 10:34:54 EST 2004


"James Kew" <james.kew at btinternet.com> wrote in message news:<c0rkhb$1ai18a$1 at ID-71831.news.uni-berlin.de>...
> "Chris Herborth" <chrish at cryptocard.com> wrote in message
> news:5z3Yb.3920$Cd6.177500 at news20.bellglobal.com...
> >
> > PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
> > DOM-producing routines.
> 
> Which are? I like PyXML, but well-documented it ain't. I tend to use PyXML's
> minidom, fed by either the validating (== xmlproc) or non-validating (==
> expat) parsers -- are there faster PyXML alternatives?
> 
> > pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
> > parser for Python, but it doesn't produce a DOM or have a SAX API...
> 
> And recent threads here suggest it's not fully XML-compliant either, unless
> you can work in an ASCII-only XML subset.

Yes, and this is a very serious problem.  Anyone entering into XML
processing with the belief that they'll never need anything but
Unicode characters under U+256 is fooling himself.  Heck, even XML
exports from MS Office will generate high Unicode characters for
"smart" quotes, em nd en dashes, ellipses and a lot of other comon
punctuation.  All of these will blow up with PyRXP.

You can use PyRXPU, which is compliant but indications are that it
isn't as fast.


> For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
> glowing reviews. It's not a standard DOM API, though, and again
> documentation is a problem (lots of C-API-level documentation, but not much
> in terms of how to put it together into a working Python app).
> 
> I gave it a whirl and it certainly seemed to fly, but getting to grips with
> the API and converting my existing DOM-manipulating code to it felt like too
> much of a hurdle given that my app runs fast enough as it is.

This was my biggest problem with libxml2/Python as documented here:

http://www.xml.com/pub/a/2003/05/14/py-xml.html

If documentation for Python users is improved, it will be hard to beat
that package.

But your criteria lead me to suggest that you give cDomlette a try.  I
is also implemented in C for performance.  It's as much DOM compliant
as libxml2's DOM API (which is to say not fully so), but we do try to
document it from the Python POV.  See:

http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes

--Uche
http://uche.ogbuji.net



More information about the Python-list mailing list