PEP 393 vs UTF-8 Everywhere

Sun Jan 22 10:19:34 EST 2017

Steve D'Aprano <steve+python at pearwood.info>:

> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
>
>> Steve D'Aprano <steve+python at pearwood.info>:
>> 
>>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>>> Also, [surrogates] don't exist as Unicode code points. Python
>>>> shouldn't allow surrogate characters in strings.
>>>
>>> Not quite. This is where it gets a bit messy and confusing. The
>>> bottom line is: surrogates *are* code points, but they aren't
>>> *characters*.
>> 
>> All animals are equal, but some animals are more equal than others.
>
> Huh?

There is no difference between 0xD800 and 0xD8000000. They are both
numbers that don't--and won't--represent anything in Unicode. It's
pointless to call one a "code point" and not the other one. A code point
that isn't code for anything can barely be called a code point.

I'm guessing 0xD800 is called a code point because it was always called
that. It was dropped out when UTF-16 was invented but they didn't want
to "demote" the number retroactively, especially since Windows and Java
already were allowing them in strings.

>>> By the letter of the Unicode standard, [Python] should not do this,
>>> but nevertheless it does and it appears to do no real harm and have
>>> some benefit.
>> 
>> I'm afraid Python's choice may lead to exploitable security holes in
>> Python programs.
>
> Feel free to back up that with an actual demonstration of an exploit,
> rather than just FUD.

It might come as a surprise to programmers that pathnames cannot be
UTF-encoded or displayed. Also, those situations might not show up
during testing but only with appropriately crafted input.

Marko