[Python-Dev] UTF-16 code point comparison

Bill Tutt billtut@microsoft.com
Thu, 27 Jul 2000 07:18:56 -0700


> From: 	M.-A. Lemburg [mailto:mal@lemburg.com] 

> Fredrik Lundh wrote:
> > 
> > mal wrote:
> > > This really has nothing to do with being able to support
> > > surrogates or not (as Fredrik mentioned), it is the correct
> > > behaviour provided UTF-16 is used as encoding for UCS-4 values
> > > in Unicode literals which is what Python currently does.
> > 
> > Really?  I could have sworn that most parts of Python use
> > UCS-2, not UTF-16.

> The design specifies that Py_UNICODE refers to UTF-16. To make
> life easier, the implementation currently assumes UCS-2 in
> many parts, but this is should only be considered a
> temporary situation. Since supporting UTF-16 poses some
> real challenges (being a variable length encoding), full
> support for surrogates was postponed to some future
> implementation.

Heh. Now you're being silly. Supporting UTF-16 isn't that difficult. You
always know whether the character is a low surrogate or a high surrogate.
The interesting question is whether or not you have your builtin Unicode
object expose each 16-bit character as is, or you support iterating over
Py_UCS4 characters, or you want to have a wrapping object that does the
right thing here.
This might be the way to go.

> > Built-ins like ord, unichr, len; slicing;
> > string methods; regular expressions, etc. all clearly assume
> > that a Py_UNICODE is a unicode code point.
> > 
> > My point is that we shouldn't pretend we're supporting
> > UTF-16 if we don't do that throughout.

> We should keep that design detail in mind though.
 
> > As far as I can tell, cmp() is the *only* unicode function
> > that thinks the internal storage is UTF-16.
> 
> > Everything else assumes UCS-2.

No, its UTF-16, it just doesn't yet handle surrogates in all of the
appropriate places. :)
The unicodename stuff also needs to support future named surrogate
characters now.

> > And for Python 2.0, it's surely easier to fix cmp() than to
> > fix everything else.

> Also true :-)

Everything but the regular expressions would be fairly simple to add UTF-16
support to. I'd imagine that adding support for \u10FFFF in the regular
expression syntax wouldn't be that hard either.

> > (this is the same problem as 8-bit/unicode comparisions, where
> > the current consensus is that it's better to raise an exception
> > if it looks like the programmer doesn't know what he was doing,
> > rather than pretend it's another encoding).

> Perhaps you are right and we should #if 0 the comparison
> sections related to UTF-16 for now. I'm not sure why Bill
> needed the cmp() function to support surrogates... Bill ?

I didn't need it to. I happened upon the code on the IBM website, so I
figured I'd point it out and see what people thought of sticking it into the
Python Unicode stuff. :) (Wishing Python 2.0 would ship with Unicode
collation support) See the earlier comment about creating a wrapping class
that handles UTF-16 issues better. 

> Still, it will have to be reenabled sometime in the
> future when full surrogate support is added to Python.
 
Bill