PEP 393 vs UTF-8 Everywhere

eryk sun eryksun at gmail.com
Sat Jan 21 15:49:26 EST 2017


On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman <petef4+usenet at gmail.com> wrote:
> Marko Rauhamaa <marko at pacujo.net> writes:
>
>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
>
> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
> Unicode. On a wide build or Python 3.3+ then all is rosy. (At this point
> I'm tempted to put in a winky emoji but that might push the internal
> representation into UCS-4.)

CPython allows surrogate codes for use with the "surrogateescape" and
"surrogatepass" error handlers, which are used for POSIX and Windows
file-system encoding, respectively. Maybe MicroPython goes about the
file-system round-trip problem differently, or maybe it just require
using bytes for file-system and environment-variable names on POSIX
and doesn't care about Windows.

"surrogateescape" allows 'decoding' arbitrary bytes:

    >>> b'\x81'.decode('ascii', 'surrogateescape')
    '\udc81'
    >>> '\udc81'.encode('ascii', 'surrogateescape')
    b'\x81'

This error handler is required by CPython on POSIX to handle arbitrary
bytes in file-system paths. For example, when running with LANG=C:

    >>> sys.getfilesystemencoding()
    'ascii'
    >>> os.listdir(b'.')
    [b'\x81']
    >>> os.listdir('.')
    ['\udc81']

"surrogatepass" allows encoding surrogates:

    >>> '\udc81'.encode('utf-8', 'surrogatepass')
    b'\xed\xb2\x81'
    >>> b'\xed\xb2\x81'.decode('utf-8', 'surrogatepass')
    '\udc81'

This error handler is used by CPython 3.6+ to encode Windows UCS-2
file-system paths as WTF-8 (Wobbly). For example:

    >>> os.listdir('.')
    ['\udc81']
    >>> os.listdir(b'.')
    [b'\xed\xb2\x81']



More information about the Python-list mailing list