PEP 393 vs UTF-8 Everywhere

Sun Jan 22 21:14:20 EST 2017

On Mon, 23 Jan 2017 02:19 am, Marko Rauhamaa wrote:

> Steve D'Aprano <steve+python at pearwood.info>:
> 
>> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
>>
>>> Steve D'Aprano <steve+python at pearwood.info>:
>>> 
>>>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>>>> Also, [surrogates] don't exist as Unicode code points. Python
>>>>> shouldn't allow surrogate characters in strings.
>>>>
>>>> Not quite. This is where it gets a bit messy and confusing. The
>>>> bottom line is: surrogates *are* code points, but they aren't
>>>> *characters*.
>>> 
>>> All animals are equal, but some animals are more equal than others.
>>
>> Huh?
> 
> There is no difference between 0xD800 and 0xD8000000. 

Arithmetic disagrees:

py> 0xD800 == 0xD8000000
False

> They are both 
> numbers that don't--and won't--represent anything in Unicode.

Your use of hex notation 0x... indicates that you're talking about code
units rather than U+... code points. The first one 0xD800 could be:

- a Little Endian double-byte code unit for 'Ø' in either UCS-2 or UTF-16;

- a Big Endian double-byte code unit that has no special meaning in UCS-2;

- one half of a surrogate pair (two double-byte code units) in Big Endian
  UTF-16, encoding some unknown supplementary code point.

The second one 0xD8000000 could be:

- a C long (four-byte int) 3623878656, which is out of range for Big Endian
  UCS-4 or UTF-32;

- the Little Endian four-byte code unit for 'Ø' in either UCS-4 or UTF-32.

> It's pointless to call one a "code point" and not the other one. 

Neither of them are code points. You're confusing the concrete
representation with the abstract character.

Perhaps you meant to compare the code point U+D800 to, well, there's no
comparison to be made, because "U+D8000000" is not valid and is completely
out of range. The largest code point is U+10FFFF.

> A code point 
> that isn't code for anything can barely be called a code point.

It does have a purpose. Or even more than one.

- It ensures that there is a one-to-one mapping between code points and
  code units in any specific encoding and byte-order.

- By reserving those code points, it ensures that they cannot be
  accidentally used by the standard for something else.

- It makes it easier to talk about the entities: "U+D800 is a surrogate 
  code point reserved for UTF-16 surrogates", as opposed to "U+D800 isn't
  anything, but if it was something, it would be a code point reserved 
  for UTF-16 surrogates".

- Or worse, forcing us to talk in terms of code units (implementation)
  instead of abstract characters, which is painfully verbose:

  "0xD800 in Big Endian UTF-16, or 0x00D8 in Little Endian UTF-16, or 
  0x0000D800 in Big Endian UTF-32, or 0x00D80000 in Little Endian 
  UTF-16, doesn't map to any code point but is reserved for UTF-16
  surrogate pairs."

And, an entirely unforeseen purpose:

- It allows languages like Python to (ab)use surrogate code points for
  round-tripping file names which aren't valid Unicode.

[...]
>>> I'm afraid Python's choice may lead to exploitable security holes in
>>> Python programs.
>>
>> Feel free to back up that with an actual demonstration of an exploit,
>> rather than just FUD.
> 
> It might come as a surprise to programmers that pathnames cannot be
> UTF-encoded or displayed. 

Many things come as surprises to programmers, and many pathnames cannot be
UTF-encoded.

To be precise, Mac OS requires pathnames to be both valid and normalised
UTF-8, and it would be nice if that practice spread. But Windows only
requires pathnames to consist of UCS-2 code points, and Linux pathnames are
arbitrary bytes that may include characters which are illegal on Windows.
So you don't need to involve surrogates to have undecodable pathnames.

> Also, those situations might not show up 
> during testing but only with appropriately crafted input.

I'm not seeing a security exploit here.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.