[Python-Dev] Windows: Remove support of bytes filenames in theos module?

Wed Feb 10 15:30:33 EST 2016

On Wednesday, February 10, 2016 6:50 AM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Andrew Barnert via Python-Dev writes:
> 
>>  That doesn't mean the problem can't be solved. Apple solved their
>>  equivalent problem, albeit by sacrificing backward compatibility in
>>  a way Microsoft can't get away with. I haven't seen a MacRoman or
>>  Shift-JIS filename since they broke the last holdout
> 
> If you lived where I do, you'd still be seeing both, because you
> wouldn't be able to escape archival files on CD and removable media
> (typically written on Windows boxen). They still work, sort of ==
> same as always, and as far as I know, that's because Apple has *not*
> sacrificed backward compatibility: under the hood, Darwin is still a
> POSIX kernel which thinks of file names and everything else outside of
> memory as bytestreams.

Sure, but the Darwin kernel can't read CDs; that's up to the CD filesystem driver.

Anyway, Windows CDs can't cause this problem. Windows CDs use the Joliet filesystem,[^1] which stores everything in UCS2.[^2] When you call CreateFileA or fopen or _open with bytes, Windows decodes those bytes and stores them as UCS2. The filesystem drivers on POSIX platforms have to encode that UCS2 to _something_ (POSIX APIs make it very hard for you to deal with filename strings like 
"A\0B\0C\0.\0T\0X\0T\0\0\0"...). The linux driver uses a mount option to decide how to encode; the OS X driver always uses UTF-8. And every valid UCS2 string can be encoded as UTF-8, so you can use unicode everywhere, even in Python 2.

Of course you can have mojibake problems, but that's a different issue,[^3] and no worse with unicode than with bytes.[^4]

The same thing is true with NTFS external drives, VFAT USB drives, etc. Generally, it's usually not Windows media on *nix systems that break Python 2 unicode; it's native *nix filesystems where users mix locales.

> One place they *fail very badly* is Shift JIS filenames in zipfiles,
> which nothing provided by Apple can deal with safely, and InfoZip
> breaks too (at least in MacPorts). Yes, I know that is specifically
> disallowed. Feel free to tell 1_0000_0000 Japanese Windows users.

The good news is, as far as I can tell, it's not disallowed anymore.[^5] So we just have to tell them that they shouldn't have been doing it in the past. :)

Anyway, zipfiles are data files as far as the OS is concerned; the fact that they contain filenames is no more relevant to the kernel (or filesystem driver or userland) than the fact that "List of PDFs to Read This Weekend.txt" contains filenames.

PS, everything Apple provides is already using Info-ZIP.

>>  So Python 2 works great on Macs, whether you use bytes or
>>  unicode. But that doesn't help us on Windows, where you can't use
>>  bytes, or Linux, where you can't use Unicode (without surrogate
>>  escape or some other mechanism that Python 2 doesn't have).
> 
> You contradict yourself! ;-)

Yes, as I later realized, sometimes, you _can_ (or at least ought to be able to--I haven't actually tried) use Python 2 with unicode everywhere to write cross-platform software that actually works on linux, by using backports of surrogate-escape and pathlib, and the io module instead of the file type, as long as you only need stdlib and third-party modules that support unicode filenames. If that does work for at least some apps, then I'm perfectly happen to have been wrong earlier. And if catching myself before someone else did makes me a flip-flopper, well, I'm not running for president. :P

  [^1]: Except when Vista and 7 mistakenly think your CD is a DVD and use UDF instead of ISO9660--but in that case, the encoding is stored in the filesystem header, so it's also not a problem.

  [^2]: Actually, despite Microsoft's spec, later versions of Windows store UTF-16, even if there are surrogate pairs, or BMP-but-post-UCS2 code points. But that doesn't matter here; the linux, Mac, etc. drivers all assume UTF-16, which works either way.

  [^3]: Say you write a program that assumes it will only be run on Shift-JIS systems, and you use CreateFileA to create a file named "ハローワールド". The actual bytes you're sending are cp436 for "ânâìü[âÅü[âïâh", so the file on the CD is named, in Unicode, "ânâìü[âÅü[âïâh". So of course the Mac driver encodes that to UTF-8 b"ânâìü[âÅü[âïâh". You won't have any problems opening what you readdir, or what you copy from a UTF-8 terminal or a UTF-16 Cocoa app like Finder, etc. But of course you will have trouble getting your user to recognize that name as meaningful, unless you can figure out or guess or prompt the user to guess that it needs to be passed through s.encode('cp436').decode('shift-jis'). 

  [^4]: Your locale is always UTF-8 on Mac. So the only significant difference is that if you're using bytes, you need b.decode('utf-8').encode('cp436').decode('shift-jis') to fix the problem.

  [^5]: Zipfiles using the Unicode extension can store a UTF-8 transcoding along with the local bytes, in which case the local bytes do not have to be in the header-declared encoding, because they will be ignored. And I think everything Microsoft ships now handles this properly. And Info-ZIP, and therefore all of Apple's tools, also handle it properly--so, not only is it legal, it even works.