[Python-ideas] Fix default encodings on Windows

Steve Dower steve.dower at python.org
Wed Aug 10 15:39:19 EDT 2016


On 10Aug2016 1226, Random832 wrote:
> On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote:
>> Testing with obscure filenames and strings is where help will be needed
>> most :)
>
> How about filenames with invalid surrogates? For added fun, consider
> that the file system encoding is normally used with surrogateescape.

This is where it gets extra fun, since surrogateescape is not normally 
used on Windows because we receive paths as Unicode text and pass them 
back as Unicode text without ever encoding or decoding them.

Currently a broken filename (such as '\udee1.txt') can be correctly seen 
with os.listdir('.') but not os.listdir(b'.') (because Windows will 
return it as '?.txt'). It can be passed to open(), but encoding the name 
to utf-8 or utf-16 fails, and I doubt there's any encoding that is going 
to succeed.

As far as I can tell, if you get a weird name in bytes today you are 
broken, and there is no way to be unbroken without doing the actual 
right thing and converting paths on POSIX into Unicode with 
surrogateescape. So our official advice has to stay the same - treating 
paths as text with smuggled bytes is the *only* way to be truly correct. 
But unless we also deprecate byte paths on POSIX, we'll never get there. 
(Now there's a dangerous idea ;) )

Cheers,
Steve



More information about the Python-ideas mailing list