[I18n-sig] Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 15:53:25 -0400


> Guido van Rossum wrote:
> > 
> >...
> > 
> >   - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
> >     raises an exception at Python-to-bytecode compile-time
> 
> unichr(i) is an expression. When would it be evaluated at compile-time?

My mistake.  The corresponding \U would be a compile-time error,
unichr() of course a run-time error.

> Also, I'm not sure what runtime behavior you want for these "very large"
> unichr(i) values.
> 
> In general I don't understand why we're treating the > 0x11000 range
> specially at all?

When using UCS-2 + surrogate pairs (== UTF-16), they are not
representable, and the Unicode and ISO standards have effectively
declared that this will be the supported range forever.  (For *some*
definition of forever. :-)

When using UCS-4 mode, I was in favor of allowing unichr() and \U to
specify any value in range(0x100000000L), but that's not what Martin
and Fredrik checked in.  Note that if C code somehow creates a UCS-4
string containing something with the high bit on, ord() will currently
return a negative value on platforms where a C long is 32 bits.
Returning a Python long int with a positive value would be more
consistent, but since these values aren't useful, I wonder if we
should care.  On the other hand, do we want ord() to raise an error
when the value is not a legal Unicode code point?  (Fortunately lone
surrogates are still legal code points -- AFAIK all values in
range(0x110000) are legal code points.)

Definitely a PEP question; it's not cast in stone.

--Guido van Rossum (home page: http://www.python.org/~guido/)