[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
"Martin v. Löwis"
martin at v.loewis.de
Tue Apr 28 23:02:59 CEST 2009
Glenn Linderman wrote:
> On approximately 4/28/2009 1:25 PM, came the following characters from
> the keyboard of Martin v. Löwis:
>>> The UTF-8b representation suffers from the same potential ambiguities as
>>> the PUA characters...
>>
>> Not at all the same ambiguities. Here, again, the two choices:
>>
>> A. use PUA characters to represent undecodable bytes, in particular for
>> UTF-8 (the PEP actually never proposed this to happen).
>> This introduces an ambiguity: two different files in the same
>> directory may decode to the same string name, if one has the PUA
>> character, and the other has a non-decodable byte that gets decoded
>> to the same PUA character.
>>
>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>> The same ambiguity does *NOT* exist. If a file on disk already
>> contains an invalid surrogate code in its file name, then the UTF-8b
>> decoder will recognize this as invalid, and decode it byte-for-byte,
>> into three surrogate codes. Hence, the file names that are different
>> on disk are also different in memory. No ambiguity.
>
> C. File on disk with the invalid surrogate code, accessed via the str
> interface, no decoding happens, matches in memory the file on disk with
> the byte that translates to the same surrogate, accessed via the bytes
> interface. Ambiguity.
Is that an alternative to A and B?
Regards,
Martin
More information about the Python-Dev
mailing list