[I18n-sig] Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Tue, 26 Jun 2001 19:47:05 -0400


> Aren't we trying to get of the maximum int size? And even if we keep it,
> the rule for working with large integers is simple: calculations work on
> particular ranges of inputs. Period. 

Well... 0xffffffff is negative on 32-bit systems but positive on
64-systems, and there are other anomalies like it.

It's not ideal, but given the forces at work (some folks need UCS-4,
some folks don't want to waste 2 extra bytes per character, we don't
want to revise the implementation to hide the existence of surrogates
in the 2-byte version) I think it's the best we can offer.

> If I understand correctly, the surrogates proposal will (for example)
> change this from legal to illegal:
> 
> if unichr(0x10000) in somestring:
> 	...
> 
> Because sometimes unichr is a single-char string and sometimes it will
> actually produce a 2-byte encoding.

Yes good example for the PEP. :-)

> > Do you want to write the PEP?
> 
> If nobody pipes up to say that they've started it, then I'll do a first
> draft tonight. I presume you mean write the PEP up as you described it
> and not as I would like it.

Great, Paul!  I'm tired of writing PEPs myself today.

> So I guess I would want to cover
> 
>  * what is the issue
>  * what are surrogates
>  * how Py_UNICODE effects literals and unichr
>  * rationale for doing surrogate generation
>  * description of the configure switches
>  * description of why other options were rejected

Yes.  You can quote liberally from the i18n list.

Use PEP number 261.  Thanks so much!

--Guido van Rossum (home page: http://www.python.org/~guido/)