[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
"Martin v. Löwis"
martin at v.loewis.de
Tue Apr 28 22:25:07 CEST 2009
> The UTF-8b representation suffers from the same potential ambiguities as
> the PUA characters...
Not at all the same ambiguities. Here, again, the two choices:
A. use PUA characters to represent undecodable bytes, in particular for
UTF-8 (the PEP actually never proposed this to happen).
This introduces an ambiguity: two different files in the same
directory may decode to the same string name, if one has the PUA
character, and the other has a non-decodable byte that gets decoded
to the same PUA character.
B. use UTF-8b, representing the byte will ill-formed surrogate codes.
The same ambiguity does *NOT* exist. If a file on disk already
contains an invalid surrogate code in its file name, then the UTF-8b
decoder will recognize this as invalid, and decode it byte-for-byte,
into three surrogate codes. Hence, the file names that are different
on disk are also different in memory. No ambiguity.
Regards,
Martin
More information about the Python-Dev
mailing list