[I18n-sig] Re: Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 13:44:20 -0400


> The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope:
> You must not write them, but you may read them.

Agreed.  Clarifying: if you read one pair when converting to UCS-4,
you should store one character; when converting to UCS-2, you should
store a pair, of course.

> The next question then is what to do with lone surrogate triplets; the
> table in TR 27 suggests they are legal, but people on this list have
> argued they must neither be emitted nor consumed (since what you get
> is not a legal USV).

I see two positions possible:

(1) it's up to the application to ensure this, not to the codec, so
    the codec needn't check for this;

(2) the codec's output should be legal, and this is a good time to
    check for illegalities.

Since both are reasonable positions, perhaps the error handling option
of the codec should be used to decide?

Neither of "strict", "replace" or "ignore" really matches the
semantics of (1) however; perhaps this behavior should be called
"lenient".

--Guido van Rossum (home page: http://www.python.org/~guido/)