[Python-Dev] When should pathlib stop being provisional?

Wed Apr 6 02:25:05 EDT 2016

On Wed, Apr 6, 2016 at 3:37 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Chris Angelico writes:
>
>  > Outside of deliberate tests, we don't create files on our disks
>  > whose names are strings of random bytes;
>
> Wishful thinking.  First, names made of control characters have often
> been deliberately used by miscreants to conceal their warez.  Second,
> in some systems it's all too easy to create paths with components in
> different locales (the place I've seen it most frequently is in NFS
> mounts).  I think that's much less true today, but perhaps that's only
> because my employer figured out that it was much less pain if system
> paths were pure ASCII so that it mostly didn't matter what encoding
> users chose for their subtrees.

Control characters are still characters, though. You can take a
bytestring consisting of byte values less than 32, decode it as UTF-8,
and have a series of codepoints to work with.

If your employer has "solved" the problem by restricting system paths
to ASCII, that's a fine solution for a single system with a single
ASCII-compatible encoding; a better solution is to mandate UTF-8 as
the file system encoding, as that's what most people are expecting
anyway.

> It remains important to be able to handle nearly arbitrary bytestrings
> in file names as far as I can see.  Please note that 100 million
> Japanese and 1 billion Chinese by and large still prefer their
> homegrown encodings (plural!!) to Unicode, while many systems are now
> defaulting filenames to UTF-8.  There's plenty of room remaining for
> copying bytestrings to arguments of open and friends.

Why exactly do they prefer these other encodings? Are they
representing characters that Unicode doesn't contain? If so, we have a
fundamental problem (no Python program is going to be able to cope
with these, without a third party library or some stupid mess of local
code); if not, you can always represent it as Unicode and encode it as
UTF-8 when it reaches the file system. Re-encoding is something that's
easy when you treat something as text, and impossible when you treat
it as bytes.

So far, you're still actually agreeing with me: paths are *text*, but
sometimes we don't know the encoding (and that's a problem to be
solved).

ChrisA