Newbie question about text encoding

Sat Mar 7 20:45:54 EST 2015

Marko Rauhamaa wrote:

> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> 
>> Marko Rauhamaa wrote:
>>
>>> That said, UTF-8 does suffer badly from its not being
>>> a bijective mapping.
>>
>> Can you explain?
> 
> In Python terms, there are bytes objects b that don't satisfy:
> 
>    b.decode('utf-8').encode('utf-8') == b

Are you talking about the fact that not all byte streams are valid UTF-8?
That is, some byte objects b may raise an exception on b.decode('utf-8').

I don't see why that means UTF-8 "suffers badly" from this. Can you give an
example of where you would expect to take an arbitrary byte-stream, decode
it as UTF-8, and expect the results to be meaningful?

For those cases where you do wish to take an arbitrary byte stream and
round-trip it, Python now provides an error handler for that.

py> import random
py> b = bytes([random.randint(0, 255) for _ in range(10000)])
py> s = b.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
invalid start byte
py> s = b.decode('utf-8', errors='surrogateescape')
py> s.encode('utf-8', errors='surrogateescape') == b
True

-- 
Steven