[XML-SIG] Matching NameChars

Thu, 29 Mar 2001 19:39:12 +0200

I have now committed two new modules, utils/xmlchargen.py and
xml/utils/characters.py (generated from the former). These represent
common regular expressions: specifically, expressions for the
productions in sections B and 2.3, Names and Tokens. For each of them,
there is a string constant Foo represending a regular expression, and
a compiled regular expression re_Foo.

I've changed xmlproc to use those. As it turns out, this will
slow-down parsing on an example document (the XSLT spec) by 3%,
contrary to my earlier (more optimistic) measurements.

Marc-Andr=E9 suggested to write C code to speed this up. So here is a
revised challenge for any prospective contributor: write a C module
that emulates xml.utils.characters, by providing objects with the same
methods as the compiled regular expressions, but faster matching
algorithms. Alternatively, come up with a patch to sre that performs
faster matching when presented with Unicode character classes - that
would help more Python users than the former approach.

Hint: Please have a look at how expat represents the bitmaps, that
appears to be quite efficient. I'd discourage outright copying of
those tables, though - somebody should verify that they are still
correct for XML 1.0 2nd edition.

Regards,
Martin