[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 09:42:01 -0400


> This would mean 4 bytes per Unicode character and is
> unacceptable given the fact that most of these would be 0-bytes

Agreed, but see below.

> in practice. It would also break binary compatibility to the
> native Unicode wchar_t type on e.g. Windows which we are among
> the most Unicode-aware platforms there are today.

Shouldn't there be a conversion routine between wchar_t[] and
Py_UNICODE[] instead of assuming they have the same format?  This will
come up more often, and Linux has sizeif(wchar_t) == 4 I believe.
(Which suggests that others disagree on the waste of space.)

> > > BTW, Python's Unicode implementation is bound to the standard
> > > defined at www.unicode.org; moving over to ISO 10646 is not an
> > > option.
> > 
> > Can you elaborate? How can you rule out that option that easily?
> 
> It is not an option because we chose Unicode as our basis for 
> i18n work and not the ISO 10646 Uniform Character Set. I'd rather
> have those two camps fight over the details of the Unicode standard
> than try to fix anything related to the differences between the two
> in Python by mixing them.

Agreed.  But be prepared that at some point in the future the Unicode
world might end up agreeing on 4 bytes too...

> > And why can't Python support the two standards simultaneously?
> 
> Why would you want to support two standards for the same thing ?

Well, we support ASCII and Unicode. :-)

If ISO 10646 becomes important to our users, we'll have to support
it, if only by providing a codec.

--Guido van Rossum (home page: http://www.python.org/~guido/)