[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

"Martin v. Löwis" martin at v.loewis.de
Wed Apr 29 22:28:54 CEST 2009


>>>>>>> C. File on disk with the invalid surrogate code, accessed via the
>>>>>>> str interface, no decoding happens, matches in memory the file on disk
>>>>>>> with the byte that translates to the same surrogate, accessed via the
>>>>>>> bytes interface.  Ambiguity.
>> What does that mean? What specific interface are you referring to to
>> obtain file names? 
> 
> os.listdir("")
> 
> os.listdir(b"")
> 
> So I guess I'd better suggest that a specific, equivalent directory name
> be passed in either bytes or str form.

[Leaving the issue of the empty string apparently having different
meanings aside ...]

Ok. Now I understand the example. So you do

os.listdir("c:/tmp")
os.listdir(b"c:/tmp")

and you have a file in c:/tmp that is named "abc\uDC10".

> So what you are saying here is that Python doesn't use the "A" forms of
> the Windows APIs for filenames, but only the "W" forms, and uses lossy
> decoding (from MS) to the current code page (which can never be UTF-8 on
> Windows).

Actually, it does use the A form, in the second listdir example. This,
in turn (inside Windows), uses the lossy CP_ACP encoding. You get back
a byte string; the listdirs should give

["abc\uDC10"]
[b"abc?"]

(not quite sure about the second - I only guess that CP_ACP will replace
the half surrogate with a question mark).

So where is the ambiguity here?

> You are further saying that Python doesn't give the programmer control
> over the codec that is used to convert from W results to bytes, so that
> on Windows, it is impossible to obtain a bytes result containing UTF-8
> from os.listdir, even though sys.setfilesystemencoding exists, and
> sys.getfilesystemencoding is affected by it, and the latter is
> documented as returning "mbcs", and as returning the codec that should
> be used by the application to convert str to bytes for filenames.
> (Python 3.0.1).

Not exactly. You *can* do setfilesystemencoding on Windows, but it has
no effect, as the Python file system encoding is never used on Windows.
For a string, it passes it to the W API as is; for bytes, it passes it
to the A API as-is. Python never invokes any codec here.

> While I can hear a "that is outside the scope of the PEP" coming, this
> documentation is confusing, to say the least.

Only because you are apparently unaware of the status quo. If you would
study the current Python source code, it would be all very clear.

> Things are a little clearer in the documentation for
> sys.setfilesystemencoding, which does say the encoding isn't used by
> Windows -- so why is it permitted to change it, if it has no effect?).

As in many cases: because nobody contributed code to make it behave
otherwise. It's not that the file system encoding is "mbcs" - the
file system encoding is simply unused on Windows (but that wasn't
always the case, in particular not when Windows 9x still had to
be supported).

Regards,
Martin



More information about the Python-Dev mailing list