PEP 393 vs UTF-8 Everywhere

Sat Jan 21 10:50:40 EST 2017

Steve D'Aprano <steve+python at pearwood.info> writes:

> [...]
> Another factor which I didn't see discussed anywhere is that Python
> strings treat surrogates as normal code points. I believe that would
> be troublesome for a UTF-8 implementation:
>
> py> '\uDC37'.encode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
> position 0: surrogates not allowed
>
> but of course with a UCS-2 or UTF-32 implementation it is trivial: you
> just treat the surrogate as another code point like any other.

Thanks for a very thorough reply, most useful. I'm going to pick you up
on the above, though.

Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC
3629 (2003). There is CESU-8 if you really need a naive encoding of
UTF-16 to UTF-8-alike.

py> low = '\uDC37'

is only meaningful on narrow builds pre Python 3.3 where the user must
do extra to correctly handle characters outside the BMP.

-- 
Pete Forman
https://payg-petef.rhcloud.com