[XML-SIG] Re: Can anyone recommend a sensible XML parser for Python?

Glyph Lefkowitz glyph@twistedmatrix.com
Sun, 08 Sep 2002 20:06:53 -0500 (CDT)


----Security_Multipart(Sun_Sep__8_20:06:53_2002_145)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

On Sat, 07 Sep 2002 00:10:51 -0600, Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:
> > On Fri, 06 Sep 2002 17:31:51 -0600, Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:

> > I suppose I could try to wrap HtmlParser with minidom... yuck.  Gross, but
> > probably a good idea, come to think of it :)

> I can't imagine why this would be gross.

Sorry, I was saying that making sense of non-XHTML HTML is kind of gross.  I
did say that it was a good idea, and it's definitely a neat trick.

> Accordin to the DOM Level 2 spec: "And, cloning Document, DocumentType,
> Entity, and Notation nodes is implementation dependent."

This is why standards compliance is not terribly important to me.  I would
rather have a useful XML API than a standardized one.

> Can you expand a bit more on the actual use case that makes you think you want 
> to clone a document node?

I have a template "frame" document.  I want to clone the document, populate it
with information lifted from other XML files, and then write the resultant
(cloned) document out.  This is the very first use-case I ever had working with
XML and it is still the most common.

> We choose not to allow it.  Perfectly legal, and I think this is the right 
> choice.

Yes, but the point remains that this *used* to work, and now it *doesn't*.
This is functionality I found useful.  While I can't comment on the intrinsic
sense or nonsense of cloning document nodes in DOM, I do know that it's
difficult to keep track of when features like this appear and disappear in the
various different XML solutions for Python.

Maybe this is the only feature that has done this; I don't know.  It just
happens that it's a very commonly-used one for me.

This is just another instance of my general complaint that tracking versioning
dependencies is not worth the effort for my degenerately simple use-cases for
XML.

> You mean you can't require, say PyXML 0.8.1?  Tough crowd you develop for?
> :-)

There are still some parties interested in Twisted who are upset that it
requires Python 2.1; in fact, I felt guilty doing 2.1 support because I am
likely going to have to backport portions of it to 1.5.2 for some people.  We
can all thank Red Hat for this inane persistence of ancient python versions,
but it is sadly the world I live in.

> > My main frustration is with packaging.

> Here you have a point.  Python, PyXML, and a lot of the related packages move
> very quickly,. and so quickly that they cause all manner of packaging
> problems.

This is my main point, and this is the one that the PyXML community can do the
least to address.  Buggy and idiosyncratic implementations are already in the
wild, and some apps will depend on those particular bugs and idiosyncrasies.
If twisted depends on a new or different set of bugs and quirks, I make it
incompatible with whatever other XML-using applications are out there today.

Given that XML is an integration technology this is certainly less than
desirable.

> There is no easy solution to this.

Having a project that is precipitously approaching 1.0 myself, I can
sympathize.  As much as this sort of dependency and compatibility problem has
bothered me, I *know* there will be people that write apps for Twisted and will
curse my name when I enhance some functionality later on :-).

> I have had it in mind to suggest a PyXML-in-a-tie type effort in the Python
> Business Forum once the effort on Python itself starts to gain legs.  I guess
> I can count on you to at least help cheerlead?  :-)

Cheerleading, certainly :-).  Although I'm less interested in seeing PyXML
prepared for "business" clients and more interested in just seeing the level of
QA on the volunteer work go up.  If I *had* any spare "scarce resources" to
commit beyond my own projects, I would certainly help getting the unit tests
unified and automated.

> > or produce what amounts to my own `implementation' of an XML parser.
> 
> If you try going this route, I guarantee you'll still be trying to get the 
> most basic things right six months from now.

...

> > For the applications that I'm intending to write, just doing my own parser and
> > API is both more appealing and more rewarding.
> 
> Really?  Color me deep skeptical.  I have not seen an application on earth 
> where implementing one's own parser is a good idea, and precious few where 
> implementing one's own API is a good idea.  I have a lot of colleagues who 
> have tried.

While it is *possible* that I'm smarter than you think I am, it is certain that
I'm more stubborn.  My sophomoric attempt at an XML parser is now in Twisted
CVS.

I've had this objection raised over writing yet another a web server, yet
another remote procedure call protocol, yet another asynchronous socket server
and yet another database interface.  It seems like at least some of these ideas
were good ones, so I went ahead and wrote an XML parser and representation
anyway :-).

A fellow I know from IRC once said "it's easier to write an s-expression parser
for a particular platform by hand than to learn to use any of the XML tools for
that platform".  I think that if you're interested in keeping your focus narrow
in terms of what you do with XML, the same is true of writing an XML parser.

As a data point for this hypothesis, writing the parser and the node tree took
me less than half as much time as writing these posts to various mailing lists
about XML tools (not counting this post, which has been the most
time-consuming): it took less than a quarter as much time as attempting (and
failing) to track down bugs in PyXML, not counting the time I spent trying to
figure out how to turn off undesired features in a way that would work on more
than one version.  My two main existing PyXML-using applications are already
ported to this, changing barely any of their code.

Even so, this is almost not a fair comparison because I have several months of
experience with those tools on Python 2.1, and I've read a few books on XML
already.

> > Neither DOM nor SAX will present an API which allows me to get network XML
> > events in quite the way I want, so I'm going to have to do some wrapping.

> I have learned through my own bitter experience that you do not want network
> interfaces to have *anything* to do with the lexical XML layer (or even
> Infoset).  It is best to design network interactions around *application*
> level semantics.  Basically sending around chunks of XML text is far less
> hazardous than what I think you mean.

I'm not sure what you think I mean, really, but specifically, I'm thinking
particularly of parsing and routing Jabber XML streams.  If they are designed
in a "hazardous" way then it's not my issue...  I don't think much of their
protocol design as it is, especially with regard to routing.  (As you might
guess, I think the whole idea of using XML as a network protocol is rather
strange; but Jabber in particular could have been much better done.  BEEP, for
example, I consider odd, but not broken.)

> > (I do wish pyRXP were event-based... it's very close, in spirit, to what I
> > want.)  If the general quality of XML parsers in Python were really high, I
> > would regard this impulse as contrary and counterproductive -- why write my
> > own library for doing this when perfectly good ones already exist and and
> > are deployed all over the place?

> Well, as I said, I don't see any evidence that the quality of XML parsers in 
> Python is not high.  You pointed out one problem in cloneNode which, from what 
> I gather, was mostly because you're abusing DOM.  This had nothing to do with 
> parsing.  Are you speaking generically?

When I run my particular XML-munging tool, sometimes I get:

    NameError: global name 'PROCESSING_INSTRUCTION_NODE' is not defined

which we have discussed the reasons for here.  Slightly less often, but still with
a significant frequency (same python, same PyXML, same input), I get:

    zsh: segmentation fault       ] (doc/howto/basics)

I can't present hard evidence for this, I'm sorry, because I'm not familiar
with the internals of PyXML or expat and I can't get the bug to happen
reliably.  If I can ever boil it down to something predictable (i.e. less than
1500 lines of code and half a meg of XML to trigger it) be assured I will make
the most complete bug report I can.

> > Nevertheless, it is easier to write my own XML parser than to even properly
> > report the bugs that I have thus far discovered.

> I find this claim ludicrous on its face.  Writing an XML parser with the 
> compliance level and quality of any of the ones in PyXML takes years.  Yes.  
> Years.

I never claimed to need a parser with PyXML's level of compliance; in fact,
I've said several times that compliance at that level is annoying to me because
it's too strict.

I think we're going to have to agree to disagree on "quality", but at least for
my use cases I don't get occasional coredumps from my parser.  I cannot
substantiate this with real bug reports, so please feel free to dismiss this as
FUD if you disagree.  From my discussions with other developers near my
interest area, however, QA on the PyXML project is notoriously poor, and the
quality is wildly variant from release to release.  As you yourself have said,
this is likely to remain so until someone funds improvements.

I do not feel as though I am owed anything in particular by the PyXML project
or by any subscriber to any of these lists.  In fact, I'm quite grateful for it
having provided a nice, simple introduction to the world of XML; I probably
would not be using XML today at all if it weren't for the PyXML project.
Unfortunately, due to my larger-than-average concerns about dependencies and
ease of automating testing for my own project, I don't think that PyXML is a
good solution.  I need a *very* small XML library, with no strings attached.
PyXML is huge, and featureful, and I'm sure in the most recent incarnations
it's very robust.  It does come with a lot of strings attached though.

I have decided it's not worth my time at this point to invest a lot of effort
in helping out, until a few versions go by and the general impressions I get
from XML developers I work with are becoming more positive.  This doesn't mean
I won't lend a helping hand when I can, but the communication overhead to
working in the PyXML community is not currently worth the gain I would get from
it.

I wish you the best of luck in making me look foolish for saying that :-).

-- 
 |    <`'>    |  Glyph Lefkowitz: Traveling Sorcerer   |
 |   < _/ >   |  Lead Developer,  the Twisted project  |
 |  < ___/ >  |      http://www.twistedmatrix.com      |

----Security_Multipart(Sun_Sep__8_20:06:53_2002_145)--
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQA9e/QwvVGR4uSOE2wRAjGQAJ9vT0mgRknUubzodsun+Pj6geYlTwCglQWP
QOZ+9KV3DfQVQJ8xPjkrdoM=
=YDh3
-----END PGP SIGNATURE-----

----Security_Multipart(Sun_Sep__8_20:06:53_2002_145)----