Encoding of surrogate code points to UTF-8

Tue Oct 8 21:28:25 EDT 2013

On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
> On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:
>
>> In any case, "\ud800\udc01" isn't a valid unicode string.
>
> I don't think this is correct. Can you show me where the standard says
> that Unicode strings[1] may not contain surrogates? I think that is a

see below.

> critical point, and the FAQ conflates *encoded strings* (i.e. bytes using
> one of the UTCs) with *Unicode strings*.
>
> The string you give above is is a Unicode string containing two code
> points, the surrogates U+D800 U+DC01, which as far as I am concerned is a
> legal string (subject to somebody pointing me to a definitive source that
> proves it is not). However, it *may or may not* be encodable to bytes
> using UTF-8, -16 or -32.

 From chapter two of the standard.

"Plain text is a pure sequence of character codes; plain Unicode-encoded 
text is therefore a sequence of Unicode character codes."

http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708
"All three encoding forms can be used to represent the full range of 
encoded characters in the Unicode Standard; ... Each of the three 
Unicode encoding forms can be efficiently transformed into eith
er of the other two without any loss of data."

"Surrogates Area. The Surrogates Area contains only surrogate code 
points and no encoded characters. See Section 16.6, Surrogates Area, for 
more detail."

Before utf-16, the surrogates area was, I believe, part of the Private 
Use Area (which now starts where surrogates end). I think it would have 
been better if they were no longer called code points, but simply utf-16 
code units.

> Just as there are byte sequences that cannot be generated by the UTFs,
> possibly there are code point sequences that cannot be converted to bytes
> using the UTFs.

True, but not to the point. You switched from sequences of characters 
(unicode text), which is what both I and Neil are talking about, to 
sequences of codepoints which is a larger set when you include the 
non-character surrogate 'code points' that are not allowed in unicode text.

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404

"The Unicode Standard supports three character encoding forms: UTF-32, 
UTF-16, and UTF-8. Each encoding form maps the Unicode code points 
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences."

 > [1] Sequences of Unicode code points.

This is not the Standard's definition of 'unicode text'. It is also not 
its definition of 'unicode string'.

"D80 Unicode string: A code unit sequence containing code units of a 
particular Unicode encoding form."

In other words, a Unicode string is a utf encoding of unicode text. The 
FSR adaptively uses a subset of possible sequences from all three, 
though only one utf is used for any particular string.

--
D79 says what I claimed before: "The mapping of the set of Unicode 
scalar values to the set of code unit sequences for a Unicode encoding 
form is one-to-one."

-- 
Terry Jan Reedy