[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

Mon Sep 29 19:06:01 CEST 2008

> Victor Stinner schrieb:

(Thanks Victor for moving this to the list. Having a discussion in the
tracker is really painful, I find.)

>> POSIX OS
>> --------
>>
>> The default behaviour should be to use unicode and raise an error if
>> conversion to unicode fails. It should also be possible to use bytes using
>> bytes arguments and optional arguments (for getcwd).
>>
>>  - listdir(unicode) -> unicode and raise an error on invalid filename

I know I keep flipflopping on this one, but the more I think about it
the more I believe it is better to drop those names than to raise an
exception. Otherwise a "naive" program that happens to use
os.listdir() can be rendered completely useless by a single non-UTF-8
filename. Consider the use of os.listdir() by the glob module. If I am
globbing for *.py, why should the presence of a file named b'\xff'
cause it to fail?

Robust programs using os.listdir() should use the bytes->bytes version.

>>  - listdir(bytes) -> bytes
>>  - getcwd() -> unicode
>>  - getcwd(bytes=True) -> bytes
>>  - open(): accept bytes or unicode
>>
>> os.path.*() should accept operations on bytes filenames, but maybe not on
>> bytes+unicode arguments. os.path.join('directory', b'filename'): raise an
>> error (or use *implicit* conversion to bytes)?

(Yeah, it should be all bytes or all strings.)

On Mon, Sep 29, 2008 at 9:45 AM, Georg Brandl <g.brandl at gmx.net> wrote:

> This approach (changing all path-handling functions to accept either bytes
> or string, but not both) is doomed in my eyes. First, there are lots of them,
> second, they are not only in os.path but in many modules and also in user
> code, and third, I see no clean way of implementing them in the specified way.
> (Just try to do it with os.path.join as an example; I couldn't find the
> good way to write it, only the bad and the ugly...)

It doesn't have to be supported for all operations -- just enough to
be able to access all the system calls. and do the most basic pathname
manipulations (split and join -- almost everything else can be built
out of those).

> If I had to choose, I'd still argue for the modified UTF-8 as filesystem
> encoding (if it were UTF-8 otherwise), despite possible surprises when a
> such-encoded filename escapes from Python.

I'm having a hard time finding info about UTF-8b. Does anyone have a
decent link?

I noticed that OSX has a different approach yet. I believe it insists
on valid UTF-8 filenames. It may even require some normalization but I
don't know if the kernel enforces this. I tried to create a file named
b'\xff' and it came out as %ff. Then "rm %ff" worked. So I think it
may be replacing all bad UTF8 sequences with their % encoding.

The "set filesystem encoding to be Latin-1" approach has a certain
charm as well, but clearly would be a mistake on OSX, and probably on
other systems too (whenever the user doesn't think in Latin-1).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)