[XML-SIG] pyexpat & xmlproc : irreproductible results!

11 Nov 1999 11:13:22 +0100

* Alain Michard
|
| This time I tried to be systematic! You'll find the results below.
| To make it short, on these tests, pyexpat is always faster than
| xmlproc, with a relative difference which can vary enormously
| depending on the type of xml file and on the application programme.

I've done more or less the same tests myself, and my results seem to
agree with yours. (I think your results when using Canonizer are due
to the program spending most of its time in the DocumentHandler rather
than the parser.)

I did these tests on my home PC, a Pentium 166MHz with 80 MB of RAM,
running Windows 95. The documents used are:

 othello:       The othello.xml document from Jon Bosak's Shakespeare
                collection. No attributes. 248 K
 newlist:       An old version of the data document used for my free
                XML tools index. Heavy on markup and with quite a few
                attributes. 74 K
 teij31:        A small document marked up in the XML version of TEI 
                Lite. Some attributes, comments and CDATA sections.
                56 K
 rec-xml:       The XML specification in XML. Heavy use of attributes,
                comments, CDATA marked sections and entities. 158 K
 nt:            The New Testament, from Bosak's religious collection.
                Very simple structure, no attributes. 1 MB
 WD-xslt:       As the XML specification. 182 K

I would consider these results (except for #2) to be accurate in the
two most significant digits only.

Test #1, with empty DocumentHandlers

                othello newlist teij31  rec-xml nt      WD-xslt         TOTAL
sgmlop          0.053   0.022   0.016   0.038   0.172   0.038           0.341
pyexpat         2.766   0.144   0.232   0.017   4.230   1.388           8.780
xmlproc         8.935   2.399   0.950   5.188   12.00   4.056           33.53
xmlproc_val     15.65   4.387   19.07   5.309   21.14   12.00           77.57
xmllib          32.29   6.982   1.902   14.80   45.14   12.22           113.3

Test #2, with xbel2html.py on an 862 K XBEL document

sgmlop          6.539
pyexpat         11.82
xmlproc         41.55
xmlproc_val     59.14
xmllib          80.62

Test #3, with xml.com stats collector (goes through all attributes)

                othello newlist teij31  rec-xml nt      WD-xslt       TOTAL
sgmlop          4.636   0.967   0.296   1.877   5.942   1.670         13.51*
pyexpat         5.189   1.348   0.459   -       7.354   2.461         16.83
xmlproc         11.42   2.990   1.162   5.298   15.174  5.064         35.81*
xmlproc_val     18.59   5.108   20.16   5.560   24.791  13.33         81.97*
xmllib          34.93   7.739   2.112   17.87   47.65   15.28         107.7*

*rec-xml is not included in the sum

This third test unearthed a small non-conformance in xmllib version
0.2: that it reports characters outside the document (or root)
element. I don't know if this still applies to the latest version of
xmllib. 

Also note that the xmlproc version used here is the version currently
in my CVS tree, and this is a little faster than 0.61.

The only conclusions I can draw from this is that sgmlop is indeed the
fastest parser (and very much so if it is given no DocumentHandler),
that pyexpat is about twice as fast as xmlproc and that this is not
affected by differences in documents or applications.

--Lars M.