[XML-SIG] Performance question

Henry S. Thompson ht@cogsci.ed.ac.uk
06 Nov 2002 13:59:16 +0000


"Fred L. Drake, Jr." <fdrake@acm.org> writes:

> Henry S. Thompson writes:
>  > If you want _another_ factor of 10, go to PyLTXML.  The report below
>  > is from Python 2.2.1 on RedHat Linux 7.2 using PyXML 0.8.1 and
>  > PyLTXML-1.3-2.
> 
> Wow!  That's fast!
> 
>  > I used Fred's driver, added two new functions to text bit-level and
>  > tree-level access via PyLTXML.
>  > 
>  > parser performance test
>  > 100 parses took 3.88 seconds, or 0.04 seconds/parse
>  > 100 parses took 0.25 seconds, or 0.00 seconds/parse
>  > 100 parses took 0.02 seconds, or 0.00 seconds/parse
>  > 100 parses took 0.03 seconds, or 0.00 seconds/parse
>  > 
>  > The first measurement is the original 4DOM DOM builder, the second is
>  > the expatbuilder, the third is PyLTXML returning the whole tree, the
>  > fourth is PyLTXML returning every bit (start tag, end tag, text).  I
>  > guess the tree is faster because it's slightly lazy wrt Python
>  > structures, i.e. only the root is in Python form as returned, the rest
>  > gets converted from the native C structs as you walk the Python tree.
> 
> So is the resulting object compliant (or at least close) to the Python
> DOM, as defined in the Python Library Reference?
> 
>     http://www.python.org/doc/current/lib/module-xml.dom.html

Close.

> (Lazy building of structures is fine, of course, since that's
> implementation.)  If it doesn't support the DOM API, does it support
> something with an equivalent model and functionality?

I believe so -- our model actually _predates_ the DOM, and we've never
had the time/resources to roll it forward, but it was of course
solving the same problem.

The documentation lists the following Python object types:

  FileType
  DoctypeType
  ElementTypeType
  ContentParticleType
  AttrDefnType
  BitType
  ItemType
  OOBType
  ERefType
  QueryType

These correspond to the xml.dom objects as follows, I think:

FileType       * 13.6.2.1 DOMImplementation Objects
ItemType       * 13.6.2.2 Node Objects
python tuple   * 13.6.2.3 NodeList Objects
DoctypeType    * 13.6.2.4 DocumentType Objects
FileType       * 13.6.2.5 Document Objects
ItemType       * 13.6.2.6 Element Objects
not exposed    * 13.6.2.7 Attr Objects
not exposed    * 13.6.2.8 NamedNodeMap Objects
OOBType        * 13.6.2.9 Comment Objects
ItemType       * 13.6.2.10 Text and CDATASection Objects
OOBType        * 13.6.2.11 ProcessingInstruction Objects 

The details are in the documentation which comes with the source
distribution, which uses distutils and is GPL-click-wrapped at

   http://www.ltg.ed.ac.uk/software/xml/

To avoid hassle, you'll want the source and the appropriate binary
distribution at a minimum -- actually _building_ the extension
requires an LT XML installation as well.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2002, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/
 [mail really from me _always_ has this .sig -- mail without it is forged spam]