[XML-SIG] Character classes
Uche Ogbuji
uche.ogbuji@fourthought.com
Fri, 11 Jan 2002 18:15:45 -0700
> Appendix B of the XML REC, at
> http://www.w3.org/TR/2000/REC-xml-20001006#CharClasses, specifies the
> Unicode characters that are allowed in element names. It doesn't look
> like anything in the PyXML package actually implements them, though.
> For example, I've just run into this with 4DOM, where Document.py
> contains:
>
> #FIXME: should allow combining characters: fix when Python gets Unicode
> g_namePattern = re.compile('[a-zA-Z_:][\w\.\-_:]*\Z')
Oh, hey. I wrote that comment, must have been 2 years ago.
I guess it's time to put in a fix, eh? If someone files a bug report, I can
try to have a look post 4Suite 1.12.0 release.
> One of the RELAX NG test cases is a document with a high-Unicode tag
> name, and so that's why I'm hitting this.
>
> Document.py would need to be changed, but so would xmlproc and
> doubtless other pieces of code. Therefore, there should be a separate
> module containing character info that both 4DOM and xmlproc could use.
> (xml/chars.py?) But what should chars.py contain? Strings? (BaseChar
> = "\u0041\u0042...") Lists of legal characters? (BaseChar = [0x41,
> 0x42, ...]) Something else?
It should contain full character tables: might as well go all the way correct.
I know from where we might snag such tables in expat.
> Appendix B of the XML REC derives the character classes from the
> Unicode 2.0 character database. Should we just write out all the
> expressions from Appendix B as regex patterns, or derive them from the
> database? Note that Python comes with Unicode 3.0, so maybe we can't
> use the database at all!
Hmm. My head might be foggy, but I think Unicode 3.0 vs 2.1 will only affect
character data, not namechars (at least until XML 1.1, AKA Blueberry).
> Also, there doesn't seem to be a C-level API for querying the Unicode
> database, which means there's no easy way to fix sgmlop.c.
I guess we should just define an API for the lookup tables in C. That would
make it even easier to raid expat for the goods.
--
Uche Ogbuji Principal Consultant
uche.ogbuji@fourthought.com +1 303 583 9900 x 101
Fourthought, Inc. http://Fourthought.com
4735 East Walnut St, Boulder, CO 80301-2537, USA
XML strategy, XML tools (http://4Suite.org), knowledge management