[XML-SIG] Character classes

M.-A. Lemburg mal@lemburg.com
Sat, 12 Jan 2002 14:33:08 +0100


Martin v. Loewis wrote:

>>The Unicode 3.0 database is mostly backward compatible w/r to Unicode 2.0;
>>except for a few well documented changes. I don't think we should care
>>about those...
>>
> 
> For strict XML conformance, one may want to worry; see
> xml/xmlchargen.


The XML spec doesn't mention a specific Unicode
version. Unicode 3 is mentioned in the spec as well:

	http://www.w3.org/TR/REC-xml

OTOH, Letter is defined explicitly without reference to the
Unicode database:

	http://www.w3.org/TR/REC-xml#NT-Letter

> Also, it isn't easy to construct the XML character
> classes with just the Python Unicode properties. For example, Python's
> .isalpha() mostly matches XML's BaseChar class, except for the Roman
> numerals, and the ESTIMATED SYMBOL, which got recategorized in 3.0.
> 
> For NameChar, the Python Unicode support does not offer anything
> close. The regular expressions \w is a strict superset, but contains
> many characters that match \w but are not NameChars (e.g. SUPERSET TWO).


In that case, I suppose you ought to simply create a database
similar to that used by unicodectype.c which uses the explicit
character ranges defined in the XML spec as reference and
provides API for querying isLetter(), isBaseChar() etc.

Tools/unicode/makeunicodedata.py has the needed tools to
generate such a table, so this shouldn't be too complicated.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/