Newbie question about text encoding

Sat Mar 7 11:54:14 EST 2015

Chris Angelico <rosuav at gmail.com>:

> On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>>>> Marko Rauhamaa wrote:
>>>>>> That said, UTF-8 does suffer badly from its not being
>>>>>> a bijective mapping.
>>>>>
>> Here's an example:
>>
>>    b = b'\x80'
>>
>> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
>> from str objects to bytes objects.
>
> That's not the same as what you said.

Except that it's precisely what I said.

> All you've proven is that there are bit patterns which are not UTF-8
> streams...

And that causes problems.

> which is a very deliberate feature.

Well, nobody desired it. It was just something that had to give.

I believe you *could* have defined it as a bijective mapping but then
you would have lost the sorting order correspondence.

> How does UTF-8 *suffer* from this? It benefits hugely!

You can't operate on file names and text files using Python strings. Or
at least, you will need to add (nontrivial) exception catching logic.

Marko