[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman v+python at g.nevcal.com
Wed Apr 29 23:09:26 CEST 2009


On approximately 4/29/2009 1:28 PM, came the following characters from 
the keyboard of Martin v. Löwis:
>>>>>>>> C. File on disk with the invalid surrogate code, accessed via the
>>>>>>>> str interface, no decoding happens, matches in memory the file on disk
>>>>>>>> with the byte that translates to the same surrogate, accessed via the
>>>>>>>> bytes interface.  Ambiguity.
>>> What does that mean? What specific interface are you referring to to
>>> obtain file names? 
>> os.listdir("")
>>
>> os.listdir(b"")
>>
>> So I guess I'd better suggest that a specific, equivalent directory name
>> be passed in either bytes or str form.
> 
> [Leaving the issue of the empty string apparently having different
> meanings aside ...]
> 
> Ok. Now I understand the example. So you do
> 
> os.listdir("c:/tmp")
> os.listdir(b"c:/tmp")
> 
> and you have a file in c:/tmp that is named "abc\uDC10".
> 
>> So what you are saying here is that Python doesn't use the "A" forms of
>> the Windows APIs for filenames, but only the "W" forms, and uses lossy
>> decoding (from MS) to the current code page (which can never be UTF-8 on
>> Windows).
> 
> Actually, it does use the A form, in the second listdir example. This,
> in turn (inside Windows), uses the lossy CP_ACP encoding. You get back
> a byte string; the listdirs should give
> 
> ["abc\uDC10"]
> [b"abc?"]
> 
> (not quite sure about the second - I only guess that CP_ACP will replace
> the half surrogate with a question mark).
> 
> So where is the ambiguity here?

None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.

Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.  Or is it 
the Python philosophy that the PEPs should be as incomprehensible as 
possible, to generate large discussions?


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


More information about the Python-Dev mailing list