[Python-Dev] Unicode strings as filenames

Martin v. Loewis martin@v.loewis.de
Sun, 6 Jan 2002 20:44:45 +0100


> That's the global, sure but the code using it is scattered
> across fileobject.c and the posix module. I think it would be
> a good idea to put all this file naming code into some
> Python/fileapi.c file which then also provides C APIs for
> extensions to use. These APIs should then take the file name
> as PyObject* rather than char* to enable them to handle
> Unicode directly.

What do you gain by that? Most of the posixmodule functions that take
filenames are direct wrappers around the system call. Using another
level of indirection is only useful if the fileapi.c functions are
used in different places. Notice that each function (open, access,
stat, etc) is used exactly *once* currently, so putting this all into
a single place just makes the code more complex.

The extensions module argument is a red herring: I don't think there
are many extension modules out there which want to call access(2) but
would like to do so using a PyObject* as the first argument, but
numbers as the other arguments.

> > Of course, if the system has an open function that expects wchar_t*,
> > we might want to use that instead of going through a codec. Off hand,
> > Win32 seems to be the only system where this might work, and even
> > there, it won't work on Win95.
> 
> I expect this to become a standard in the next few years.

I doubt that. Posix people (including developers of various posixish
systems) have frequently rejected that idea in recent years. Even for
the most recent system in this respect (OS X), we hear that they still
open files with a char*, where char is byte - the only advancement is
that there is a guarantee that those bytes are UTF-8. 

It turns out that this is all you need: with that guarantee, there is
no need for an additional set of APIs. UTF-8 was originally invented
precisely to represent file names (and was called UTF-1 at that time);
it is more likely that more systems will follow this convention. If
so, a global per-system file system encoding is all that's needed.

The only problem is that on Windows, MS has already decided that the
APIs are in CP_ANSI, so they cannot change it to UTF-8 now; that's why
Windows will need special casing if people are unhappy with the "mbcs"
approach (which some apparantly are).

> > Also, it is more difficult than threads: for threads, there is a fixed
> > set of API features that need to be represented. Doing Py_UNICODE*
> > opening alone is easy, but look at the number of posixmodule functions
> > that all expect file names of some sort.
> 
> 
> Doesn't that support the idea of having a small subsystem
> in Python which exposes the Unicode aware APIs to Python
> and its extensions ?

No. It is a lot of work, and an additional layer of indirection, with
no apparent advantage. Feel free to write a PEP, though.

Regards,
Martin