PEP 393 vs UTF-8 Everywhere

Sat Jan 21 21:36:47 EST 2017

On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:

> Pete Forman <petef4+usenet at gmail.com>:
> 
>> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
>> and UTF-32.
> 
> Also, they don't exist as Unicode code points. Python shouldn't allow
> surrogate characters in strings.

Not quite. This is where it gets a bit messy and confusing. The bottom line
is: surrogates *are* code points, but they aren't *characters*. Strings
which contain surrogates are strictly speaking illegal, although some
programming languages (including Python) allow them.

The Unicode standard defines surrogates as follows:

http://www.unicode.org/glossary/

- Surrogate Character. A misnomer. It would be an encoded character 
  having a surrogate code point, which is impossible. Do not use 
  this term.

- Surrogate Code Point. A Unicode code point in the range 
  U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of 
  surrogate code units (a high surrogate followed by a low surrogate)
  “stand in” for a supplementary code point.

- Surrogate Pair. A representation for a single abstract character
  that consists of a sequence of two 16-bit code units, where the
  first value of the pair is a high-surrogate code unit, and the
  second is a low-surrogate code unit. (See definition D75 in 
  Section 3.8, Surrogates.)

http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf#G2630

So far so good, this is clear: surrogates are not characters. Surrogate
pairs are only used by UTF-16 (since that's the only UTF which uses 16-bit
code units).

Suppose you read two code units (four bytes) in UTF-16 (big endian):

b'\xd8\x02\xdd\x00'

That could be ambiguous, as it could mean:

- a single code point, U+10900 PHOENICIAN LETTER ALF, encoded as
  the surrogate pair, U+D802 U+DD00;

- two surrogate code points, U+D802 followed by U+DD00.

UTF-16 definitely rejects the second alternative and categorically makes
only the first valid. To ensure that is the only valid interpretation, the
second is explicitly disallowed: conforming Unicode strings are not allowed
to include surrogate code points, regardless of whether you are intending
to encode them in UTF-32 (where there is no ambiguity) or UTF-16 (where
there is). Only UTF-16 encoded bytes are allowed to include surrogate code
*units*, and only in pairs.

However, Python (and other languages?) take a less strict approach. In
Python 3.3 and better, the code point U+10900 (Phoenician Alf) is encoded
using a single four-byte code unit: 0x00010900', so there's no ambiguity.
That allows Python to encode surrogates as double-byte code units with no
ambiguity: the interpreter can distinguish a single 32 byte number
0x00010900 from a pair of 16 bit numbers 0x0001 0x0900 and treat the first
as Alf and the second as two surrogates.

By the letter of the Unicode standard, it should not do this, but
nevertheless it does and it appears to do no real harm and have some
benefit.

The glossary also says:

- High-Surrogate Code Point. A Unicode code point in the range 
  U+D800 to U+DBFF. (See definition D71 in Section 3.8, Surrogates.)

- High-Surrogate Code Unit. A 16-bit code unit in the range D800_16 
  to DBFF_16, used in UTF-16 as the leading code unit of a surrogate
  pair. Also known as a leading surrogate. (See definition D72 in 
  Section 3.8, Surrogates.)

- Low-Surrogate Code Point. A Unicode code point in the range 
  U+DC00 to U+DFFF. (See definition D73 in Section 3.8, Surrogates.)

- Low-Surrogate Code Unit. A 16-bit code unit in the range DC00_16
  to DFFF_16, used in UTF-16 as the trailing code unit of a surrogate
  pair. Also known as a trailing surrogate. (See definition D74 in
  Section 3.8, Surrogates.)

So we can certainly talk about surrogates being code points: the code points
U+D800 through U+DFFF inclusive are surrogate code points, but not
characters.

They're not "non-characters" either. Unicode includes exactly 66 code points
formally defined as "non-characters":

- Noncharacter. A code point that is permanently reserved for internal
  use. Noncharacters consist of the values U+nFFFE and U+nFFFF (where
  n is from 0 to 1016), and the values U+FDD0..U+FDEF. See the FAQ on
  Private-Use Characters, Noncharacters and Sentinels.

http://www.unicode.org/faq/private_use.html#noncharacters

So even though noncharacters (with or without the hyphen) are code points
reserved for internal use, and surrogates are code points reserved for the
internal use of the UTF-16 encoder, and surrogates are not characters,
surrogates are not noncharacters.

Naming things is hard.

>    Thus the range of code points that are available for use as
>    characters is U+0000–U+D7FF and U+E000–U+10FFFF (1,112,064 code
>    points).
> 
>    <URL: https://en.wikipedia.org/wiki/Unicode>

That is correct for a strictly conforming Unicode implementation.

>> py> low = '\uDC37'
> 
> That should raise a SyntaxError exception.

If Python was strictly conforming, that is correct, but it turns out there
are some useful things you can do with strings if you allow surrogates.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.