[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

Tue Sep 30 00:17:20 CEST 2008

On Mon, Sep 29, 2008 at 11:06 AM, Guido van Rossum <guido at python.org> wrote:
> On Mon, Sep 29, 2008 at 9:45 AM, Georg Brandl <g.brandl at gmx.net> wrote:
>
>> This approach (changing all path-handling functions to accept either bytes
>> or string, but not both) is doomed in my eyes. First, there are lots of them,
>> second, they are not only in os.path but in many modules and also in user
>> code, and third, I see no clean way of implementing them in the specified way.
>> (Just try to do it with os.path.join as an example; I couldn't find the
>> good way to write it, only the bad and the ugly...)
>
> It doesn't have to be supported for all operations -- just enough to
> be able to access all the system calls. and do the most basic pathname
> manipulations (split and join -- almost everything else can be built
> out of those).
>
>> If I had to choose, I'd still argue for the modified UTF-8 as filesystem
>> encoding (if it were UTF-8 otherwise), despite possible surprises when a
>> such-encoded filename escapes from Python.
>
> I'm having a hard time finding info about UTF-8b. Does anyone have a
> decent link?

http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

Scroll down to item D, near the bottom.

It turns malformed bytes into lone (therefor malformed) surrogates.

> I noticed that OSX has a different approach yet. I believe it insists
> on valid UTF-8 filenames. It may even require some normalization but I
> don't know if the kernel enforces this. I tried to create a file named
> b'\xff' and it came out as %ff. Then "rm %ff" worked. So I think it
> may be replacing all bad UTF8 sequences with their % encoding.

I suspect linux will eventually take this route as well.  If ext3 had
an option for UTF-8 validation I know I'd want it on.  That'd move the
error to the program creating bogus file names, rather than those
trying to read, display, and manage them.

> The "set filesystem encoding to be Latin-1" approach has a certain
> charm as well, but clearly would be a mistake on OSX, and probably on
> other systems too (whenever the user doesn't think in Latin-1).

Aye, it's a better hack than UTF-8b, but adding byte functions is even better.

-- 
Adam Olsen, aka Rhamphoryncus