[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

Tue Sep 30 23:22:11 CEST 2008

On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Guido van Rossum wrote:
>> On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>>>> Change the default file system encoding to store bytes in Unicode is like
>>>> introducing a new Python type: <fake Unicode for filename hacks>.
>>> Exactly. Seems like the best solution to me, despite your polemics.
>>
>> Martin, I don't understand why you are in favor of storing raw bytes
>> encoded as Latin-1 in Unicode string objects, which clearly gives rise
>> to mojibake. In the past you have always been staunchly opposed to API
>> changes or practices that could lead to mojibake (and you had me quite
>> convinced).
>
> True. I try to outweigh the need for simplicity in the API against the
> need to support all cases. So I see two solutions:
>
> a) support bytes as file names. Supports all cases, but complicates
>   the API very much, by pervasively bringing bytes into the status
>   of a character data type. IMO, this must be prevented at all costs.

That's a matter of opinion. I would also like to point out that it is
in fact already supported by the system calls. io.open() doesn't, but
that's a wrapper around _fileio._FileIO which does support bytes. All
other syscalls already do the right thing (even readlink()!) except
os.listdir(), which returns a mixture of bytes and str values (which
is horrible) and os.getcwd() which needs a bytes equivalent. Victor's
patch addresses all these issues.

Victor's patch also tries to fix glob.py, fnmatch.py, and
posixpath.py. That is more debatable, because this might be the start
of a never-ending project. OTOH we have precedents, e.g. the re module
similarly supports both bytes and unicode (and makes an effort to
avoid mixing them).

> b) make character (Unicode) strings the only string type. Does not
>   immediately support all cases, so some hacks are needed. However,
>   even with the hacks, it preserves the simplicity of the API; the
>   hacks then should ideally be limited to the applications that need
>   it. On this side, I see the following approaches:
>   1. try to automatically embed non-representable characters into
>      the Unicode strings, e.g. by using PUA characters. Reduces
>      the amount of moji-bake, but produces a lot of difficult issues.
>   2. let applications that desire so access all file names in a
>      uniform manner, at the cost of producing tons of moji-bake
>
> In this case, I think moji-bake is unavoidable: it is just a plain
> flaw in the POSIX implementations (not the API or specification) that
> you can run into file names where you can't come up with the right
> rendering. Even for solution a), the resulting data cannot
> be displayed "correctly" in all cases.

But I still like the ultimate solution to displaying names for (a)
better: if it's not decodable, display it as the repr() of a bytes
object. (Which happens to be its str() as well.)

> Currently, I favor b2, but haven't given up on b1, and they don't
> exclude each other. b2 is simple to implement, and delegates the
> choice between legible file names and universal access to all files
> to the application. Given the way Unix works, this is the most sensible
> choice, IMO: by default, Python should try to make file names legible,
> but stuff like backup applications should be implementable also -
> and they don't need legible file names.

I don't believe that an application-wide choice is safe. For example
the tempfile module manipulates filenames (at least for
NamedTemporaryFile) and I think it would be wrong if it were affected
by such a global setting. (E.g. the user could pass a suffix argument
containing Unicode characters outside Latin-1.)

> I think option a) will hunt us forever. People will ask for more and
> more features in the bytes type, eventually asking "give us Python
> 2.x strings back". It already starts: see #3982, where Benjamin
> asks to have .format added to bytes (for a reason unrelated to file
> names).

I'm not so worried about feature requests for the bytes type unrelated
to filesystems; we can either grant them or not, and I am actually in
many cases in favor of granting them -- just like we support bytes in
the re module as I already mentioned above. The bytes and str types
have intentionally similar APIs, because they have similar structure,
and even somewhat similar semantics (b'ABC' and 'ABC' have related
meanings even if there are subtle differences).

I am also encouraged by Glyph's support for (a). He has a lot of
practical experience.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)