[Python-Dev] UTF-16 code point comparison

M.-A. Lemburg mal@lemburg.com
Thu, 27 Jul 2000 12:30:53 +0200


Fredrik Lundh wrote:
> 
> mal wrote:
> > This really has nothing to do with being able to support
> > surrogates or not (as Fredrik mentioned), it is the correct
> > behaviour provided UTF-16 is used as encoding for UCS-4 values
> > in Unicode literals which is what Python currently does.
> 
> Really?  I could have sworn that most parts of Python use
> UCS-2, not UTF-16.

The design specifies that Py_UNICODE refers to UTF-16. To make
life easier, the implementation currently assumes UCS-2 in
many parts, but this is should only be considered a
temporary situation. Since supporting UTF-16 poses some
real challenges (being a variable length encoding), full
support for surrogates was postponed to some future
implementation.

> Built-ins like ord, unichr, len; slicing;
> string methods; regular expressions, etc. all clearly assume
> that a Py_UNICODE is a unicode code point.
> 
> My point is that we shouldn't pretend we're supporting
> UTF-16 if we don't do that throughout.

We should keep that design detail in mind though.
 
> As far as I can tell, cmp() is the *only* unicode function
> that thinks the internal storage is UTF-16.
> 
> Everything else assumes UCS-2.

True.
 
> And for Python 2.0, it's surely easier to fix cmp() than to
> fix everything else.

Also true :-)
 
> (this is the same problem as 8-bit/unicode comparisions, where
> the current consensus is that it's better to raise an exception
> if it looks like the programmer doesn't know what he was doing,
> rather than pretend it's another encoding).

Perhaps you are right and we should #if 0 the comparison
sections related to UTF-16 for now. I'm not sure why Bill
needed the cmp() function to support surrogates... Bill ?

Still, it will have to be reenabled sometime in the
future when full surrogate support is added to Python.
 
> :::
> 
> To summarize, here's the "character encoding guidelines" for
> Python 2.0:
> 
>     In Unicode context, 8-bit strings contain ASCII. Characters
>     in the 0x80-0xFF range are invalid.  16-bit strings contain
>     UCS-2.  Characters in the 0xD800-0xDFFF range are invalid.

The latter is not true. In fact, thanks to Bill, the UTF-8
codec supports processing surrogates already and will output
correct UTF-8 code even for Unicode strings containing 
surrogates.
 
>     If you want to use any other encoding, use the codecs pro-
>     vided by the Unicode subsystem.  If you need to use Unicode
>     characters that cannot be represented as UCS-2, you cannot
>     use Python 2.0's Unicode subsystem.

See above.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/