[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 17:58:49 +0200


Guido van Rossum wrote:
> 
> > This would mean 4 bytes per Unicode character and is
> > unacceptable given the fact that most of these would be 0-bytes
> 
> Agreed, but see below.
> 
> > in practice. It would also break binary compatibility to the
> > native Unicode wchar_t type on e.g. Windows which we are among
> > the most Unicode-aware platforms there are today.
> 
> Shouldn't there be a conversion routine between wchar_t[] and
> Py_UNICODE[] instead of assuming they have the same format?  This will
> come up more often, and Linux has sizeif(wchar_t) == 4 I believe.
> (Which suggests that others disagree on the waste of space.)

There are conversion routines which map between Py_UNICODE
and wchar_t in Python and these make use of the fact that
e.g. on Windows Py_UNICODE can use wchar_t as basis which makes
the conversion very fast.

On Linux (which uses 4 bytes per wchar_t) the routine inserts
tons of zeros to make Tux happy :-)
 
> > > > BTW, Python's Unicode implementation is bound to the standard
> > > > defined at www.unicode.org; moving over to ISO 10646 is not an
> > > > option.
> > >
> > > Can you elaborate? How can you rule out that option that easily?
> >
> > It is not an option because we chose Unicode as our basis for
> > i18n work and not the ISO 10646 Uniform Character Set. I'd rather
> > have those two camps fight over the details of the Unicode standard
> > than try to fix anything related to the differences between the two
> > in Python by mixing them.
> 
> Agreed.  But be prepared that at some point in the future the Unicode
> world might end up agreeing on 4 bytes too...

No problem... we can change to 4 byte values too if the world
agrees on 4 bytes per character. However, 2 bytes or 4 bytes
is an implementation detail and not part of the Unicode standard
itself.

4 bytes per character makes things at the C level much easier
and this is probably why the GNU C lib team chose 4 bytes. Other
programming languages like Java and platforms like Windows
chose 2-byte UTF-16 as internal format. I guess it's up to the
user acceptance to choose between the two. 2 bytes means more
work on the implementor, 4 bytes means more $$$ for Micron et al. ;-)

> > > And why can't Python support the two standards simultaneously?
> >
> > Why would you want to support two standards for the same thing ?
> 
> Well, we support ASCII and Unicode. :-)
> 
> If ISO 10646 becomes important to our users, we'll have to support
> it, if only by providing a codec.

This is different: ISO 10646 is a competing standard, not just a 
different encoding.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/