[Python-Dev] Windows: Remove support of bytes filenames in the os module?

Mon Feb 8 20:57:15 EST 2016

All I can say is "ouch". Hard to call it a regression to no longer
allow this mess...

CHB

> On Feb 8, 2016, at 4:37 PM, eryk sun <eryksun at gmail.com> wrote:
>
>> On Mon, Feb 8, 2016 at 2:41 PM, Chris Barker <chris.barker at noaa.gov> wrote:
>> Just to clarify -- what does it currently do for bytes? IIUC, Windows uses
>> UTF-16, so can you pass in UTF-16 bytes? Or when using bytes is is assuming
>> some Windows ANSI-compatible encoding? (and what does it return?)
>
> UTF-16 is used in the [W]ide-character API. Bytes paths use the [A]NSI
> codepage. For a single-byte codepage, the ANSI API rountrips, i.e. a
> bytes path that's passed to CreateFileA matches the listing from
> FindFirstFileA. But for a DBCS codepage arbitrary bytes paths do not
> roundtrip. Invalid byte sequences map to the default character. Note
> that an ASCII question mark is not always the default character. It
> depends on the codepage.
>
> For example, in codepage 932 (Japanese), it's an error if a lead byte
> (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a
> value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not
> uncommon). In this case the ANSI API substitutes the default character
> for Japanese, '・' (U+30FB, Katakana middle dot).
>
>>>> locale.getpreferredencoding()
>    'cp932'
>>>> open(b'\xe05', 'w').close()
>>>> os.listdir('.')
>    ['・']
>>>> os.listdir(b'.')
>    [b'\x81E']
>
> All invalid sequences get mapped to '・', which roundtrips as
> b'\x81\x45', so you can't reliably create and open files with
> arbitrary bytes paths in this locale.