[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman v+python at g.nevcal.com
Tue Apr 28 07:25:15 CEST 2009


On approximately 4/27/2009 8:35 PM, came the following characters from 
the keyboard of Martin v. Löwis:
> Glenn Linderman wrote:
>> On approximately 4/27/2009 12:42 PM, came the following characters from
>> the keyboard of Martin v. Löwis:
>>>>> It's a private use area. It will never carry an official character
>>>>> assignment.
>>>> I know that U+F0000 - U+FFFFF is a private use area.  I don't find a
>>>> definition of U+F01xx to know what the notation means.  Are you picking
>>>> a particular character within the private use area, or a particular
>>>> range, or what?
>>> It's a range. The lower-case 'x' denotes a variable half-byte, ranging
>>> from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code
>>> points.
>>
>> So you only need 128 code points, so there is something else unclear.
> 
> (please understand that this is history now, since the PEP has stopped
> using PUA characters).


Yes, but having found the latest PEP finally (at least I hope the one at 
python.org is the latest, it has quit using PUA anyway), I confirm it is 
history.  But the same issue applies to the range of half-surrogates.


> No. You seem to assume that all bytes < 128 decode successfully always.
> I believe this assumption is wrong, in general:
> 
> py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
> 3-4: illegal multibyte sequence
> 
> All bytes are below 128, yet it fails to decode.


Indeed, that was the missing piece.  I'd forgotten about the encodings 
that use escape sequences, rather than UTF-8, and DBCS.  I don't think 
those encodings are permitted by POSIX file systems, but I suppose they 
could sneak in via Environment variable values, and the like.

The switch from PUA to half-surrogates does not resolve the issues with 
the encoding not being a 1-to-1 mapping, though.  The very fact that you 
  think you can get away with use of lone surrogates means that other 
people might, accidentally or intentionally, also use lone surrogates 
for some other purpose.  Even in file names.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


More information about the Python-Dev mailing list