Newbie question about text encoding
Marko Rauhamaa
marko at pacujo.net
Sun Mar 8 03:20:33 EDT 2015
Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> For those cases where you do wish to take an arbitrary byte stream and
> round-trip it, Python now provides an error handler for that.
>
> py> import random
> py> b = bytes([random.randint(0, 255) for _ in range(10000)])
> py> s = b.decode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
> invalid start byte
> py> s = b.decode('utf-8', errors='surrogateescape')
> py> s.encode('utf-8', errors='surrogateescape') == b
> True
That is indeed a valid workaround. With it we achieve
b.decode('utf-8', errors='surrogateescape'). \
encode('utf-8', errors='surrogateescape') == b
for any bytes b. It goes to great lengths to address the Linux
programmer's situation.
However,
* it's not UTF-8 but a variant of it,
* it sacrifices the ordering correspondence of UTF-8:
>>> '\udc80' > 'ä'
True
>>> '\udc80'.encode('utf-8', errors='surrogateescape') > \
... 'ä'.encode('utf-8', errors='surrogateescape')
False
* it still isn't bijective between str and bytes:
>>> '\udd00'.encode('utf-8', errors='surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character
'\udd00' in position 0: surrogates not allowed
Marko
More information about the Python-list
mailing list