[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Thu Apr 30 09:58:16 CEST 2009

On approximately 4/29/2009 7:50 PM, came the following characters from 
the keyboard of Aahz:
> On Thu, Apr 30, 2009, Cameron Simpson wrote:
>> The lengthy discussion mostly revolves around:
>>
>>   - Glenn points out that strings that came _not_ from listdir, and that are
>>     _not_ well-formed unicode (== "have bare surrogates in them") but that
>>     were intended for use as filenames will conflict with the PEP's scheme -
>>     programs must know that these strings came from outside and must be
>>     translated into the PEP's funny-encoding before use in the os.*
>>     functions. Previous to the PEP they would get used directly and
>>     encode differently after the PEP, thus producing different POSIX
>>     filenames. Breakage.
>>
>>   - Glenn would like the encoding to use Unicode scalar values only,
>>     using a rare-in-filenames character.
>>     That would avoid the issue with "outside' strings that contain
>>     surrogates. To my mind it just moves the punning from rare illegal
>>     strings to merely uncommon but legal characters.
>>
>>   - Some parties think it would be better to not return strings from
>>     os.listdir but a subclass of string (or at least a duck-type of
>>     string) that knows where it came from and is also handily
>>     recognisable as not-really-a-string for purposes of deciding
>>     whether is it PEP-funny-encoded by direct inspection.
> 
> Assuming people agree that this is an accurate summary, it should be
> incorporated into the PEP.

I'll agree that once other misconceptions were explained away, that the 
remaining issues are those Cameron summarized.  Thanks for the summary!

Point two could be modified because I've changed my opinion; I like the 
invariant Cameron first (I think) explicitly stated about the PEP as it 
stands, and that I just reworded in another message, that the strings 
that are altered by the PEP in either direction are in the subset of 
strings that contain fake (from a strict Unicode viewpoint) characters. 
  I still think an encoding that uses mostly real characters that have 
assigned glyphs would be better than the encoding in the PEP; but would 
now suggest that an escape character be a fake character.

I'll note here that while the PEP encoding causes illegal bytes to be 
translated to one fake character, the 3-byte sequence that looks like 
the range of fake characters would also be translated to a sequence of 3 
fake characters.  This is 512 combinations that must be translated, and 
understood by the user (or at least by the programmer).  The "escape 
sequence" approach requires changing only 257 combinations, and each 
altered combination would result in exactly 2 characters.  Hence, this 
seems simpler to understand, and to manually encode and decode for 
debugging purposes.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking