[Python-Dev] Windows: Remove support of bytes filenames in theos module?

Steve Dower python at stevedower.id.au
Tue Feb 9 23:40:17 EST 2016


On 09Feb2016 2017, Stephen J. Turnbull wrote:
>   > The problem here is the protocol that Python uses to return bytes paths,
>   > and that protocol is inconsistent between APIs and information is lost.
>
> No, the problem is that the necessary information simply isn't always
> available.  Not even today: think removable media, especially archival
> content.  Also network file systems: I don't know if it still happens,
> but I've seen Shift JIS, GB2312, and KOI8-R all in the same directory,
> and sometimes two of those in the *same path*.  (Don't ask me how
> non-malicious users managed to do the latter!)

But if we return bytes paths and the user passes them back in unchanged, 
that should be irrelevant. The earlier issue was that that doesn't work 
(e.g. a bytes path from os.scandir couldn't be passed back into open()).

>   > It really requires going through all the OS calls and either (a) making
>   > them consistently decode bytes to str using the declared FS encoding
>   > (currently 'mbcs', but I see no reason we can't make it 'utf_8'),
>
> If it were that easy, it would have been done two decades ago.  I'm no
> fan of Windows[1], but it's obvious that Microsoft has devoted
> enormous amounts of brainpower to the problem of encoding
> rationalization since the early 90s.  I don't think they would have
> missed this idea.

I meant with Python's calls into the API. Anywhere Python does the 
conversion from bytes to LPCWSTR (the UTF-16 type) there's a chance 
it'll be wrong.

Your earlier comments (regarding encoding/decoding to/from Unicode, 
which I didn't have anything valuable to add to) basically reflect the 
fact that developers need to treat bytes paths as blobs on all platforms 
and the core Python runtime needs to obtain and use them consistently. 
Which means *always* using the Win32 *A APIs and never doing a 
conversion ourselves.

Microsoft's solution here is the user's active code page, much like 
*nix's solution as I understand it, except that where *nix will convert 
_to_ the encoding as a normalized form, Windows will convert _from_ the 
encoding to its UTF-16 "normalized" form. Back-compat concerns have 
prevented any significant changes being made here, otherwise there 
wouldn't be a 'bytes' interface at all. (Or more likely, everything 
would be UTF-8 based, but back-compat is king in Windows-land.)

Cheers,
Steve


More information about the Python-Dev mailing list