[XML-SIG] Character classes

Andrew Kuchling akuchlin@mems-exchange.org
Fri, 11 Jan 2002 11:46:53 -0500


Appendix B of the XML REC, at
http://www.w3.org/TR/2000/REC-xml-20001006#CharClasses, specifies the
Unicode characters that are allowed in element names.  It doesn't look
like anything in the PyXML package actually implements them, though.
For example, I've just run into this with 4DOM, where Document.py
contains:

#FIXME: should allow combining characters: fix when Python gets Unicode
g_namePattern = re.compile('[a-zA-Z_:][\w\.\-_:]*\Z')

One of the RELAX NG test cases is a document with a high-Unicode tag
name, and so that's why I'm hitting this.  

Document.py would need to be changed, but so would xmlproc and
doubtless other pieces of code.  Therefore, there should be a separate
module containing character info that both 4DOM and xmlproc could use.
(xml/chars.py?)  But what should chars.py contain?  Strings? (BaseChar
= "\u0041\u0042...")  Lists of legal characters?  (BaseChar = [0x41,
0x42, ...])  Something else?

Appendix B of the XML REC derives the character classes from the
Unicode 2.0 character database.  Should we just write out all the
expressions from Appendix B as regex patterns, or derive them from the
database?  Note that Python comes with Unicode 3.0, so maybe we can't
use the database at all!

Also, there doesn't seem to be a C-level API for querying the Unicode
database, which means there's no easy way to fix sgmlop.c.

--amk