PEP 393 vs UTF-8 Everywhere

Sun Jan 22 03:34:07 EST 2017

Steve D'Aprano <steve+python at pearwood.info>:

> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>> Also, [surrogates] don't exist as Unicode code points. Python
>> shouldn't allow surrogate characters in strings.
>
> Not quite. This is where it gets a bit messy and confusing. The bottom
> line is: surrogates *are* code points, but they aren't *characters*.

All animals are equal, but some animals are more equal than others.

> Strings which contain surrogates are strictly speaking illegal,
> although some programming languages (including Python) allow them.

Python shouldn't allow them.

> The Unicode standard defines surrogates as follows:
> [...]
>
> - Surrogate Code Point. A Unicode code point in the range 
>   U+D800..U+DFFF. Reserved for use by UTF-16,

The writer of the standard is playing word games, maybe to offer a fig
leaf to Windows, Java et al.

> By the letter of the Unicode standard, [Python] should not do this,
> but nevertheless it does and it appears to do no real harm and have
> some benefit.

I'm afraid Python's choice may lead to exploitable security holes in
Python programs.

>>> py> low = '\uDC37'
>> 
>> That should raise a SyntaxError exception.
>
> If Python was strictly conforming, that is correct, but it turns out
> there are some useful things you can do with strings if you allow
> surrogates.

Conceptual confusion is a high price to pay for such tricks.

Marko