[I18n-sig] Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)

Gaute B Strokkenes gs234@cam.ac.uk
27 Jun 2001 13:38:33 +0100


On Wed, 27 Jun 2001, JMachin@Colonial.com.au wrote:
> 
> [earlier correspondents]
>>> Personally, I think that the codecs should report an error in the
>>> appropriate fashion when presented with a python unicode string
>>> which contains values that are not allowed, such as lone
>>> surrogates.
>> 
>> Other people have read Unicode 3.1 and came to the conclusion that
>> it mandates that implementations accept such a character...
> 
> [big Gaute]
> Well, they're wrong.  The standard is clear as ink in this regard.
> 
> [my comment]
> Unfortunately ink is usually opaque :-)

Precisely.  That's standardese for you.  8-)

> The problem is caused by section 3.8 in Unicode 3.0, which is not
> specifically amended by 3.1 as far as I can tell.

It's not; AFAIK the list of changes at
<http://www.unicode.org/unicode/reports/tr27/> is supposed to be
canonical and it's not listed.

> The offending text occurs after clause D29. It says "... every UTF
> supports lossless round-trip transcoding ..." and "... a UTF mapping
> must also map invalid Unicode scalar values to unique code value
> sequences. These invalid scalar values include [0xFFFE], [0xFFFF]
> and unpaired surrogates."

Sigh.  This means that the Unicode standard is self-contradicting.

It is nowhere defined precisely what "invalid Unicode Scalar Value"
means.  I can only assume that it means "an integer in the range 0 -
0x10FFFF that is not a Unicode Scalar Value".  Even so, the statement
is just plain wrong as far as UTF-16 is concerned.  If UTF-16 is
supposed to define a bijective mapping any sequence of integers in the
range 0 - 0x10FFFF to some set of sequences of integers in the range 0
- 0xFFFF (and this is definitely what this statement is saying) this
becomes a contradiction: suppose that H is some high surrogate value
and that L is some low surrogate value, and that U is the
corresponding USV.  Then the sequences

  H, L    <-- sequence consisting of two "invalid USVs"

and

  U       <-- sequence consisting of a single (valid) USV

both map to

  H, L    <-- sequence of two UTF-16 code points

under UTF-16, so that the mapping induced by UTF-16 is very definitely
not bijective.

I have no idea why the standard includes this apparent error, but my
best guess would be that this used to be true back in the pre-3.1 days
when UTF-16 (though not with that name) was Unicode proper and UTF-16
was not a UTF, but _the_ canonical Unicode encoding.  Note that the
statement given in D29 actually is true when applied to UTF-8 and
UTF-32.

However, let us put this annoying fact aside for a moment.  I believe
that D29 is intended to point out that the various UTFs will "just
work" if you try to encode scalar values that are not proper USVs.
This is not the same thing as saying that these invalid USVs or the
"pseudo-characters" or whatever that arise from them have any business
in a Unicode string.  In fact, Unicode conformant processes are
explicitly forbidden from interpreting or using U+FFFF or U+FFFE when
passing Unicode data between each other.  They are, however,
explicitly allowed and even encouraged to use these values internally
as sentinel or "fencepost" values.  To put this slightly differently,
a process may be storing some Unicode data internally and it may be
storing U+FFFF for some reason or another in that internal data.  It
may be convenient for the process to use an UTF to transform this data
into a more convenient form.  I think that D19 is merely pointing out
that this is actually feasible, in spite of the appearance of invalid
USVs in the internal data.

I would be indebted if any of the experts who hang out on the unicode
list could sort out this confusion.

> My interpretation of this is that the 2nd part I quoted says we must
> export the guff, and the 1st part says we must accept it back again.
> 
> I don't particularly like this idea, and am not in favour of codecs
> silently accepting such in incoming data --- I'm just pointing out
> that this "lossless round-trip transcoding" concept seems to be at
> variance with various interpretations of what is "legal".

Yup.

My take on this is that the various UTF codecs should follow the specs
to the letter and reject antything else in default mode.  There should
also be a "lenient" or "forgiving" mode in which the codec does its
best to interpret and repair broken, nonsensical or irregular data.
Off course, if an application uses this mode then it will have to be
aware of the dangers involved, including the security aspects.

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
I'm having BEAUTIFUL THOUGHTS about the INSIPID WIVES
 of smug and wealthy CORPORATE LAWYERS..