[Python-Dev] UTF-16 code point comparison
Bill Tutt
billtut@microsoft.com
Thu, 27 Jul 2000 07:18:56 -0700
> From: M.-A. Lemburg [mailto:mal@lemburg.com]
> Fredrik Lundh wrote:
> >
> > mal wrote:
> > > This really has nothing to do with being able to support
> > > surrogates or not (as Fredrik mentioned), it is the correct
> > > behaviour provided UTF-16 is used as encoding for UCS-4 values
> > > in Unicode literals which is what Python currently does.
> >
> > Really? I could have sworn that most parts of Python use
> > UCS-2, not UTF-16.
> The design specifies that Py_UNICODE refers to UTF-16. To make
> life easier, the implementation currently assumes UCS-2 in
> many parts, but this is should only be considered a
> temporary situation. Since supporting UTF-16 poses some
> real challenges (being a variable length encoding), full
> support for surrogates was postponed to some future
> implementation.
Heh. Now you're being silly. Supporting UTF-16 isn't that difficult. You
always know whether the character is a low surrogate or a high surrogate.
The interesting question is whether or not you have your builtin Unicode
object expose each 16-bit character as is, or you support iterating over
Py_UCS4 characters, or you want to have a wrapping object that does the
right thing here.
This might be the way to go.
> > Built-ins like ord, unichr, len; slicing;
> > string methods; regular expressions, etc. all clearly assume
> > that a Py_UNICODE is a unicode code point.
> >
> > My point is that we shouldn't pretend we're supporting
> > UTF-16 if we don't do that throughout.
> We should keep that design detail in mind though.
> > As far as I can tell, cmp() is the *only* unicode function
> > that thinks the internal storage is UTF-16.
>
> > Everything else assumes UCS-2.
No, its UTF-16, it just doesn't yet handle surrogates in all of the
appropriate places. :)
The unicodename stuff also needs to support future named surrogate
characters now.
> > And for Python 2.0, it's surely easier to fix cmp() than to
> > fix everything else.
> Also true :-)
Everything but the regular expressions would be fairly simple to add UTF-16
support to. I'd imagine that adding support for \u10FFFF in the regular
expression syntax wouldn't be that hard either.
> > (this is the same problem as 8-bit/unicode comparisions, where
> > the current consensus is that it's better to raise an exception
> > if it looks like the programmer doesn't know what he was doing,
> > rather than pretend it's another encoding).
> Perhaps you are right and we should #if 0 the comparison
> sections related to UTF-16 for now. I'm not sure why Bill
> needed the cmp() function to support surrogates... Bill ?
I didn't need it to. I happened upon the code on the IBM website, so I
figured I'd point it out and see what people thought of sticking it into the
Python Unicode stuff. :) (Wishing Python 2.0 would ship with Unicode
collation support) See the earlier comment about creating a wrapping class
that handles UTF-16 issues better.
> Still, it will have to be reenabled sometime in the
> future when full surrogate support is added to Python.
Bill