[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 19:01:28 +0200


Guido van Rossum wrote:
> 
> > > Shouldn't there be a conversion routine between wchar_t[] and
> > > Py_UNICODE[] instead of assuming they have the same format?  This will
> > > come up more often, and Linux has sizeif(wchar_t) == 4 I believe.
> > > (Which suggests that others disagree on the waste of space.)
> >
> > There are conversion routines which map between Py_UNICODE
> > and wchar_t in Python and these make use of the fact that
> > e.g. on Windows Py_UNICODE can use wchar_t as basis which makes
> > the conversion very fast.
> >
> > On Linux (which uses 4 bytes per wchar_t) the routine inserts
> > tons of zeros to make Tux happy :-)
> 
> Maybe this code should be restructured so that it lengthens the
> characters or not depending on the size difference between Py_UNICODE
> and wchar_t, rather than making platform assumptions.

This is how it currently works.
 
> If this is the only thing that keeps us from having a configuration
> OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.

This is not easy to fix and can certainly not be made an
option: UTF-16 has surrogates and is a variable width encoding
of Unicode while UCS-4 is a fixed width encoding.

Python currently only has minimal support for surrogates, so
purist would say that we support UCS-2. However, we deliberatly
chose this path to be able to upgrade to UTF-16 at some later
point in time and it seems that this time has now come.

> > > Agreed.  But be prepared that at some point in the future the Unicode
> > > world might end up agreeing on 4 bytes too...
> >
> > No problem... we can change to 4 byte values too if the world
> > agrees on 4 bytes per character. However, 2 bytes or 4 bytes
> > is an implementation detail and not part of the Unicode standard
> > itself.
> 
> But UTF-16 vs. UCS-4 is not an implementation detail!

True.
 
> If we store 4 bytes per character, we should treat surrogates
> differently.  I don't know where those would be converted -- probably
> in the UTF-16 to UCS-4 codec.
> 
> I'd be happy to make the configuration choice between UTF-16 and
> UCS-4, if that's doable.

Not easily, I'm afraid.
 
> > 4 bytes per character makes things at the C level much easier
> > and this is probably why the GNU C lib team chose 4 bytes. Other
> > programming languages like Java and platforms like Windows
> > chose 2-byte UTF-16 as internal format. I guess it's up to the
> > user acceptance to choose between the two. 2 bytes means more
> > work on the implementor, 4 bytes means more $$$ for Micron et al. ;-)
> 
> My 1-year old laptop has a 10 Gb hard drive and 128 Mb RAM.  Current
> machines are between 2-4 times that.  How much of that space will be
> wasted on extra Unicode?  For a typical user, most of it is MP3's
> anyway. :-)

True again :-) Still, it's the main argument people have against
using 4 bytes per character; here's a quote from Mark Davis,
the Unicode Consortium President:

http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
"""
Decisions, decisions...
  Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer
  8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements,
  UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if
  they have not yet upgraded to fully support surrogates, they will be before long. 

  If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and
  storage.
"""
 
> > > > > And why can't Python support the two standards simultaneously?
> > > >
> > > > Why would you want to support two standards for the same thing ?
> > >
> > > Well, we support ASCII and Unicode. :-)
> > >
> > > If ISO 10646 becomes important to our users, we'll have to support
> > > it, if only by providing a codec.
> >
> > This is different: ISO 10646 is a competing standard, not just a
> > different encoding.
> 
> Oh.  I didn't know.  How does it differ from Unicode?  What's the user
> acceptance?

http://www.unicode.org/unicode/consortium/memblogo.html says it all.

ISO 10646 documents are only available on a pay-per-page basis --
not really ideal for spreading the word...
(http://wwwold.dkuug.dk/JTC1/SC2/WG2/)
 
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/