[I18n-sig] Re: Unicode surrogates: just say no!

Gaute B Strokkenes gs234@cam.ac.uk
27 Jun 2001 00:52:17 +0100


On Tue, 26 Jun 2001, tree@basistech.com wrote:
> 
> UTF-8 can be used to encode encode each half of a surrogate pair
> (resulting in six-bytes for the character) --- a proposal for this
> was presented by PeopleSoft at the UTC meeting last month. UTF-8 can
> also encode the code-point directly in four bytes.

This is wrong.  It is a bug to encode a non-BMP character with six
bytes by pretending that the (surrogate) values used in the UTF-16
representation are BMP characters and encoding the character as though
it was a string consisting of that character.  It is also a bug to
interpret such a six-byte sequence as a single character.  This was
clarified in Unicode 3.1.  There are several good reasons for this,
such as unique representation, security etc. etc.

Personally, I think that the codecs should report an error in the
appropriate fashion when presented with a python unicode string which
contains values that are not allowed, such as lone surrogates.  While
it may be convenient to allow the python programmer to stick all kinds
of junk into a python unicode string it is not reasonable for the
python programmer to expect that this junk can be transformed into
something meaningful when he wants to encode it with some UTF or the
other.  This has the advantage that whenever I run something through a
codec the result is always a meaningful object of the appropriate
type.

For instance, I believe that given a python unicode string conversion
to UCS-2 should always fail if the string contains surrogates (lone or
otherwise) since UCS-2 is defined not to have surrogates.  Conversion
to UTF-16 or UTF-32 should fail whenever there is a lone surrogate,
and so on.  (These are sufficient but not necessary conditions for why
such conversions should fail.)

Off course, it may be convenient to offer alternative codecs and
variations of existing ones that have a more lenient policy for use
when the programmer so wishes, for instance to interact with buggy
implementations.  However, this should not be the default.

Is the proposal you're referring to the "UTF-8s" proposal by Oracle
et.al. ?  This was brought up on the unicode list some time ago and
met with massive negative response, along the lines of "oh my god, not
another UTF; we have too many already" and "it is broken to sort
unicode strings by looking at the words in the UTF-16 representation;
you should compare in code point order instead" (this being the reason
why UTF-8s was proposed: Oracle and certain other database vendors
have old and buggy unicode implementations that do not sort UTF-16
strings in codepoint order and wanted UTF-8s so that a traditional C
strcmp() on a UTF-8s string will give the same result as comparing the
same string in UTF-16 representation word by word.  Note that UTF-8
already has the corresponding property for UCS-4 / UTF-32; this was
one of the design criteria of UTF-8.  Essentially, Oracle & co. want
their old mistakes canonised.)

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
Did an Italian CRANE OPERATOR just experience uninhibited sensations
 in a MALIBU HOT TUB?