[XML-SIG] Character classes

Uche Ogbuji uche.ogbuji@fourthought.com
Fri, 11 Jan 2002 18:15:45 -0700


> Appendix B of the XML REC, at
> http://www.w3.org/TR/2000/REC-xml-20001006#CharClasses, specifies the
> Unicode characters that are allowed in element names.  It doesn't look
> like anything in the PyXML package actually implements them, though.
> For example, I've just run into this with 4DOM, where Document.py
> contains:
> 
> #FIXME: should allow combining characters: fix when Python gets Unicode
> g_namePattern = re.compile('[a-zA-Z_:][\w\.\-_:]*\Z')

Oh, hey.  I wrote that comment, must have been 2 years ago.

I guess it's time to put in a fix, eh?  If someone files a bug report, I can 
try to have a look post 4Suite 1.12.0 release.

> One of the RELAX NG test cases is a document with a high-Unicode tag
> name, and so that's why I'm hitting this.  
> 
> Document.py would need to be changed, but so would xmlproc and
> doubtless other pieces of code.  Therefore, there should be a separate
> module containing character info that both 4DOM and xmlproc could use.
> (xml/chars.py?)  But what should chars.py contain?  Strings? (BaseChar
> = "\u0041\u0042...")  Lists of legal characters?  (BaseChar = [0x41,
> 0x42, ...])  Something else?

It should contain full character tables: might as well go all the way correct. 
 I know from where we might snag such tables in expat.


> Appendix B of the XML REC derives the character classes from the
> Unicode 2.0 character database.  Should we just write out all the
> expressions from Appendix B as regex patterns, or derive them from the
> database?  Note that Python comes with Unicode 3.0, so maybe we can't
> use the database at all!

Hmm.  My head might be foggy, but I think Unicode 3.0 vs 2.1 will only affect 
character data, not namechars (at least until XML 1.1, AKA Blueberry).


> Also, there doesn't seem to be a C-level API for querying the Unicode
> database, which means there's no easy way to fix sgmlop.c.

I guess we should just define an API for the lookup tables in C.  That would 
make it even easier to raid expat for the goods.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Boulder, CO 80301-2537, USA
XML strategy, XML tools (http://4Suite.org), knowledge management