Newbie question about text encoding

Marko Rauhamaa marko at pacujo.net
Sat Mar 7 11:54:14 EST 2015


Chris Angelico <rosuav at gmail.com>:

> On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>>>> Marko Rauhamaa wrote:
>>>>>> That said, UTF-8 does suffer badly from its not being
>>>>>> a bijective mapping.
>>>>>
>> Here's an example:
>>
>>    b = b'\x80'
>>
>> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
>> from str objects to bytes objects.
>
> That's not the same as what you said.

Except that it's precisely what I said.

> All you've proven is that there are bit patterns which are not UTF-8
> streams...

And that causes problems.

> which is a very deliberate feature.

Well, nobody desired it. It was just something that had to give.

I believe you *could* have defined it as a bijective mapping but then
you would have lost the sorting order correspondence.

> How does UTF-8 *suffer* from this? It benefits hugely!

You can't operate on file names and text files using Python strings. Or
at least, you will need to add (nontrivial) exception catching logic.


Marko



More information about the Python-list mailing list