[I18n-sig] Re: How does Python Unicode treat surrogates?

Gaute B Strokkenes gs234@cam.ac.uk
26 Jun 2001 04:24:27 +0100


On Mon, 25 Jun 2001, guido@digicool.com wrote:

>> No problem... we can change to 4 byte values too if the world
>> agrees on 4 bytes per character. However, 2 bytes or 4 bytes
>> is an implementation detail and not part of the Unicode standard
>> itself.
> 
> But UTF-16 vs. UCS-4 is not an implementation detail!

Sure it is!  A given chunk of Unicode data is semantically just a
finite sequence of Unicode scalar values.  The difference between
UTF-16 and UCS-4 is entirely one of how you are arranging bits and
bytes to store the same information.  The meaning is exactly the same;
so it's an implementation detail.

A (somewhat far-fetched, but there you are) analogy is this: imagine
that you wish to store a true-colour bitmap in memory.  You could do
this by, say, storing the R, G and B components of a given pixel right
next to each other, in that order.  Alternatively, you could keep all
the R components in one chunk and all the G components in another, or
you could store the pixels in a different order.  All of this makes no
difference to the actual bitmap itself.

I hope you see what I mean.

> If we store 4 bytes per character, we should treat surrogates
> differently.  I don't know where those would be converted --
> probably in the UTF-16 to UCS-4 codec.

An important point here is that the sole raison d'etre of surrogates
is to enable one to store the entire 21-bit Unicode character set
within the confines of a 16-bit encoding.  If you're not dealing with
UTF-16, surrogates quite simply do not exist and the only time you
have to worry about them is when and if you wish to convert to and
from UTF-16.  As such the statement "we should treat surrogates
differently when storing four bytes per character" is rather
imprecise; the whole point is that you don't treat or worry about
surrogates at all; except during conversion to and from UTF-16,
obviously.

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
I have nostalgia for the late Sixties!  In 1969 I left my laundry with
 a hippie!!  During an unauthorized Tupperware party it was chopped &
 diced!