Newbie question about text encoding

Marko Rauhamaa marko at pacujo.net
Sat Mar 7 11:25:43 EST 2015


Chris Angelico <rosuav at gmail.com>:

> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
>>
>>> Marko Rauhamaa wrote:
>>>
>>>> That said, UTF-8 does suffer badly from its not being
>>>> a bijective mapping.
>>>
>>> Can you explain?
>>
>> In Python terms, there are bytes objects b that don't satisfy:
>>
>>    b.decode('utf-8').encode('utf-8') == b
>
> Please provide an example; that sounds like a bug. If there is any
> invalid UTF-8 stream which decodes without an error, it is actually a
> security bug, and should be fixed pronto in all affected and supported
> versions.

Here's an example:

   b = b'\x80'

Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
from str objects to bytes objects.


Marko



More information about the Python-list mailing list