[I18n-sig] Unicode surrogates: just say no!
Guido van Rossum
guido@digicool.com
Wed, 27 Jun 2001 15:53:25 -0400
> Guido van Rossum wrote:
> >
> >...
> >
> > - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
> > raises an exception at Python-to-bytecode compile-time
>
> unichr(i) is an expression. When would it be evaluated at compile-time?
My mistake. The corresponding \U would be a compile-time error,
unichr() of course a run-time error.
> Also, I'm not sure what runtime behavior you want for these "very large"
> unichr(i) values.
>
> In general I don't understand why we're treating the > 0x11000 range
> specially at all?
When using UCS-2 + surrogate pairs (== UTF-16), they are not
representable, and the Unicode and ISO standards have effectively
declared that this will be the supported range forever. (For *some*
definition of forever. :-)
When using UCS-4 mode, I was in favor of allowing unichr() and \U to
specify any value in range(0x100000000L), but that's not what Martin
and Fredrik checked in. Note that if C code somehow creates a UCS-4
string containing something with the high bit on, ord() will currently
return a negative value on platforms where a C long is 32 bits.
Returning a Python long int with a positive value would be more
consistent, but since these values aren't useful, I wonder if we
should care. On the other hand, do we want ord() to raise an error
when the value is not a legal Unicode code point? (Fortunately lone
surrogates are still legal code points -- AFAIK all values in
range(0x110000) are legal code points.)
Definitely a PEP question; it's not cast in stone.
--Guido van Rossum (home page: http://www.python.org/~guido/)