Newbie question about text encoding

Sat Mar 7 11:41:28 EST 2015

On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
>>>
>>>> Marko Rauhamaa wrote:
>>>>
>>>>> That said, UTF-8 does suffer badly from its not being
>>>>> a bijective mapping.
>>>>
>>>> Can you explain?
>>>
>>> In Python terms, there are bytes objects b that don't satisfy:
>>>
>>>    b.decode('utf-8').encode('utf-8') == b
>>
>> Please provide an example; that sounds like a bug. If there is any
>> invalid UTF-8 stream which decodes without an error, it is actually a
>> security bug, and should be fixed pronto in all affected and supported
>> versions.
>
> Here's an example:
>
>    b = b'\x80'
>
> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
> from str objects to bytes objects.

That's not the same as what you said. All you've proven is that there
are bit patterns which are not UTF-8 streams... which is a very
deliberate feature. How does UTF-8 *suffer* from this? It benefits
hugely!

ChrisA