[I18n-sig] Unicode surrogates: just say no!
Guido van Rossum
guido@digicool.com
Tue, 26 Jun 2001 19:47:05 -0400
> Aren't we trying to get of the maximum int size? And even if we keep it,
> the rule for working with large integers is simple: calculations work on
> particular ranges of inputs. Period.
Well... 0xffffffff is negative on 32-bit systems but positive on
64-systems, and there are other anomalies like it.
It's not ideal, but given the forces at work (some folks need UCS-4,
some folks don't want to waste 2 extra bytes per character, we don't
want to revise the implementation to hide the existence of surrogates
in the 2-byte version) I think it's the best we can offer.
> If I understand correctly, the surrogates proposal will (for example)
> change this from legal to illegal:
>
> if unichr(0x10000) in somestring:
> ...
>
> Because sometimes unichr is a single-char string and sometimes it will
> actually produce a 2-byte encoding.
Yes good example for the PEP. :-)
> > Do you want to write the PEP?
>
> If nobody pipes up to say that they've started it, then I'll do a first
> draft tonight. I presume you mean write the PEP up as you described it
> and not as I would like it.
Great, Paul! I'm tired of writing PEPs myself today.
> So I guess I would want to cover
>
> * what is the issue
> * what are surrogates
> * how Py_UNICODE effects literals and unichr
> * rationale for doing surrogate generation
> * description of the configure switches
> * description of why other options were rejected
Yes. You can quote liberally from the i18n list.
Use PEP number 261. Thanks so much!
--Guido van Rossum (home page: http://www.python.org/~guido/)