[Python-Dev] New Py_UNICODE doc

M.-A. Lemburg mal at egenix.com
Mon May 9 18:19:02 CEST 2005


Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
> 
>>Unicode has many code points that are meant only for composition
>>and don't have any standalone meaning, e.g. a combining acute
>>accent (U+0301), yet they are perfectly valid code points -
>>regardless of UCS-2 or UCS-4. It is easily possible to break
>>such a combining sequence using slicing, so the most
>>often presented argument for using UCS-4 instead of UCS-2
>>(+ surrogates) is rather weak if seen by daylight.
> 
> 
> I disagree. It is not just about slicing, it is also about
> searching for a character (either through the "in" operator,
> or through regular expressions). If you define an SRE character
> class, such a character class cannot hold a non-BMP character
> in UTF-16 mode, but it can in UCS-4 mode. Consequently,
> implementing XML's lexical classes (such as Name, NCName, etc.)
> is much easier in UCS-4 than it is in UCS-2. In this case,
> combining characters do not matter much, because the XML
> spec is defined in terms of Unicode coded characters, causing
> combining characters to appear as separate entities for lexical
> purposes (unlike half surrogates).

Searching for a character is possible in UCS2 as well -
even for surrogates with "in" now supporting multiple
code point searches:

>>> len(u'\U00010000')
2
>>> u'\U00010000' in u'\U00010001\U00010002\U00010000 and some extra stuff'
True
>>> u'\U00010000' in u'\U00010001\U00010002\U00010003 and some extra stuff'
False

On sre character classes: I don't think that these provide
a good approach to XML lexical classes - custom functions
or methods or maybe even a codec mapping the characters
to their XML lexical class are much more efficient in
practice.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 09 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list