[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 20:35:12 +0200


Guido van Rossum wrote:
> 
> OK, focusing on a single item.
> 
> [me]
> > > If this is the only thing that keeps us from having a configuration
> > > OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.
> 
> [MAL]
> > This is not easy to fix and can certainly not be made an
> > option: UTF-16 has surrogates and is a variable width encoding
> > of Unicode while UCS-4 is a fixed width encoding.
> 
> But even if we supported UTF-16 with surrogates, picking strings apart
> using u[i] would still be able to access the separate lower and upper
> halves of the surrogates, right, and in the presence of surrogates
> len(u) would not match the number of *characters* in u.

That's because len(u) has nothing to do with the number of 
characters in the string, it only counts the code units (Py_UNICODEs)
which are used to represent characters. The same is true for normal
strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
case) for a single code unit and in Unicode you can create characters
by combining code units 

As Mark Davis pointed out:

"""In most people's experience, it is best to leave the low level interfaces
with indices in terms of code units, then supply some utility routines that
tell you information about code points. The most useful are:

- given a string and an index into that string, how many code points are
  before it?
- given a string and a number of code points, what is the lowest index that
  contains them?
- given a string and an index into that string, is the index on a code point
  boundary?
"""
 
Python could use some more Unicode methods to answer these
questions.

> > Python currently only has minimal support for surrogates, so
> > purist would say that we support UCS-2. However, we deliberatly
> > chose this path to be able to upgrade to UTF-16 at some later
> > point in time and it seems that this time has now come.
> 
> How hard would it be to also change the party line about what the
> encoding used is based on whether we use 2 or 4 bytes?  We could even
> give three choices: UCS-2 (current situation, no surrogates), UTF-16
> (16-bit items with some surrogate support) or UCS-4 (32-bit items)?

Ehm... what are you getting at here ?
 
> > > I'd be happy to make the configuration choice between UTF-16 and
> > > UCS-4, if that's doable.
> >
> > Not easily, I'm afraid.
> 
> Can you explain why this is not easy?

Because choosing whether or not to support surrogates is a 
fundamental choice which affects far more than just the way you
access storage. Surrogates introduce variable width characters:
some characters use two or more Py_UNICODE code units while (most)
others only use one.

Remember when we discussed which internal format to use or
which default encoding to apply ? We ruled out UTF-8 because
it fails badly when it comes to slicing, concatenation, indexing,
etc. 

UTF-16 is much less painful as most code points only take
up a single code unit, but it still introduces a break in concept.

> > http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
> > """
> > Decisions, decisions...
> >   Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer
> >   8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements,
> >   UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if
> >   they have not yet upgraded to fully support surrogates, they will be before long.
> >
> >   If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and
> >   storage.
> > """
> 
> I buy that as an argument for supporting UTF-16, but not for cutting
> off the road to supporting UCS-4 for those users who would like to opt
> in.

That was not my point. I just wanted to point out how well UTF-16
is being accepted out there and that we are in good company by
moving from UCS-2 to UTF-16 as current internal format.

I don't want to cut off the road to UCS-4, I just want to make
clear that UTF-16 is a good choice and one which will last at
least some more years. We can then always decide to move on
to UCS-4 for the internal storage format.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/