[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tue Apr 28 23:02:59 CEST 2009

Glenn Linderman wrote:
> On approximately 4/28/2009 1:25 PM, came the following characters from
> the keyboard of Martin v. Löwis:
>>> The UTF-8b representation suffers from the same potential ambiguities as
>>> the PUA characters... 
>>
>> Not at all the same ambiguities. Here, again, the two choices:
>>
>> A. use PUA characters to represent undecodable bytes, in particular for
>>    UTF-8 (the PEP actually never proposed this to happen).
>>    This introduces an ambiguity: two different files in the same
>>    directory may decode to the same string name, if one has the PUA
>>    character, and the other has a non-decodable byte that gets decoded
>>    to the same PUA character.
>>
>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>>    The same ambiguity does *NOT* exist. If a file on disk already
>>    contains an invalid surrogate code in its file name, then the UTF-8b
>>    decoder will recognize this as invalid, and decode it byte-for-byte,
>>    into three surrogate codes. Hence, the file names that are different
>>    on disk are also different in memory. No ambiguity.
> 
> C. File on disk with the invalid surrogate code, accessed via the str
> interface, no decoding happens, matches in memory the file on disk with
> the byte that translates to the same surrogate, accessed via the bytes
> interface.  Ambiguity.

Is that an alternative to A and B?

Regards,
Martin