[I18n-sig] How does Python Unicode treat surrogates?
M.-A. Lemburg
mal@lemburg.com
Mon, 25 Jun 2001 17:58:49 +0200
Guido van Rossum wrote:
>
> > This would mean 4 bytes per Unicode character and is
> > unacceptable given the fact that most of these would be 0-bytes
>
> Agreed, but see below.
>
> > in practice. It would also break binary compatibility to the
> > native Unicode wchar_t type on e.g. Windows which we are among
> > the most Unicode-aware platforms there are today.
>
> Shouldn't there be a conversion routine between wchar_t[] and
> Py_UNICODE[] instead of assuming they have the same format? This will
> come up more often, and Linux has sizeif(wchar_t) == 4 I believe.
> (Which suggests that others disagree on the waste of space.)
There are conversion routines which map between Py_UNICODE
and wchar_t in Python and these make use of the fact that
e.g. on Windows Py_UNICODE can use wchar_t as basis which makes
the conversion very fast.
On Linux (which uses 4 bytes per wchar_t) the routine inserts
tons of zeros to make Tux happy :-)
> > > > BTW, Python's Unicode implementation is bound to the standard
> > > > defined at www.unicode.org; moving over to ISO 10646 is not an
> > > > option.
> > >
> > > Can you elaborate? How can you rule out that option that easily?
> >
> > It is not an option because we chose Unicode as our basis for
> > i18n work and not the ISO 10646 Uniform Character Set. I'd rather
> > have those two camps fight over the details of the Unicode standard
> > than try to fix anything related to the differences between the two
> > in Python by mixing them.
>
> Agreed. But be prepared that at some point in the future the Unicode
> world might end up agreeing on 4 bytes too...
No problem... we can change to 4 byte values too if the world
agrees on 4 bytes per character. However, 2 bytes or 4 bytes
is an implementation detail and not part of the Unicode standard
itself.
4 bytes per character makes things at the C level much easier
and this is probably why the GNU C lib team chose 4 bytes. Other
programming languages like Java and platforms like Windows
chose 2-byte UTF-16 as internal format. I guess it's up to the
user acceptance to choose between the two. 2 bytes means more
work on the implementor, 4 bytes means more $$$ for Micron et al. ;-)
> > > And why can't Python support the two standards simultaneously?
> >
> > Why would you want to support two standards for the same thing ?
>
> Well, we support ASCII and Unicode. :-)
>
> If ISO 10646 becomes important to our users, we'll have to support
> it, if only by providing a codec.
This is different: ISO 10646 is a competing standard, not just a
different encoding.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/