Newbie question about text encoding

Marko Rauhamaa marko at pacujo.net
Sun Mar 8 03:20:33 EDT 2015


Steven D'Aprano <steve+comp.lang.python at pearwood.info>:

> For those cases where you do wish to take an arbitrary byte stream and
> round-trip it, Python now provides an error handler for that.
>
> py> import random
> py> b = bytes([random.randint(0, 255) for _ in range(10000)])
> py> s = b.decode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
> invalid start byte
> py> s = b.decode('utf-8', errors='surrogateescape')
> py> s.encode('utf-8', errors='surrogateescape') == b
> True

That is indeed a valid workaround. With it we achieve

   b.decode('utf-8', errors='surrogateescape'). \
       encode('utf-8', errors='surrogateescape') == b

for any bytes b. It goes to great lengths to address the Linux
programmer's situation.

However,

 * it's not UTF-8 but a variant of it,

 * it sacrifices the ordering correspondence of UTF-8:

   >>> '\udc80' > 'ä'
   True
   >>> '\udc80'.encode('utf-8', errors='surrogateescape') > \
   ...        'ä'.encode('utf-8', errors='surrogateescape')
   False

 * it still isn't bijective between str and bytes:

   >>> '\udd00'.encode('utf-8', errors='surrogateescape')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character 
   '\udd00' in position 0: surrogates not allowed


Marko



More information about the Python-list mailing list