unicode filenames

Thu Feb 6 06:16:25 EST 2003

On 2003-02-03, Andrew Dalke wrote:

> Okay, so it seems like no one knows how to handle unicode filenames
> under Unix.

Since unix can afford to change all APIs and programs like windows did
(the mess that resulted explains why <wink>), unix must stay with the
byte-orineted filenames at the low level.  This ensures that all programs
that store file names in files, etc., continue to work.  UTF-8 is the only
encoding that can represent all of unicode that satisfies all these needs,
so everybody should migrate to UTF-8 filenames (CJK users might have
reservations to this; I'd be happy to learn their opinion).

In the transition period, many people still use other encodings,
sometimes different on different mounts.  Since filenames are
frequently storedin files, programs will break if the filename
encoding is different on different mountpoints.  If you suggest
supporting that in programs, you effective require that all utilities
like ls, find, xargs, etc. learn to convert filenames.  For if they
don't, things will break: e.g. find will produce output in a mix of
encoding, which can't be fixed.  But that's too much work!  The only
chance to do that is in glibc - but it will subtly upset a lot of
programs in any case.  This also implies that the filename encoding
must be the same as the standard I/O encoding (that's why there is
LC_FILENAME_CTYPE).

So *please*, expect the user to configure all mounts to use the same
encoding, the one he is using in his locale.  It's not hard.
Otherwise he will not be able to work with other programs anyway...
And that encoding best be UTF-8, of course.

> Perhaps the following is the proper behaviour?
>
>    1) there is a default filesystem encoding, which is initialized
>        to None if os.path.supports_unicode_file is True, otherwise
>        it's set to sys.getdefaultencoding()
>
Yep.  For corner cases, it should be settable.  And I don't like the
name, it should say "filename" instead of "file" (that prompts for
shortening some other part, like "supports").

One important point: files with names illegal in this encoding must
not become inaccessible.  Instead of raising exceptions, Python's
library should just fall back and return the byte string.

>    2) there is a registration system which is used to define encodings
>        used for different mount locations.  If a filename/dirname is
>        not covered, sue the default filesystem encoding
>
No way!  See above.  Instead of fixing a couple of places (fstab,
nfs&samba conf) you are trying to fix this in every single application
running in the system.

>    3) a) when the input dirname or filename is a string, use the
>         current behaviour
>       b) when unicode, use the encoding from 2 (may have to get
>         the absolute path name  ... don't like this part of it.
>         Perhaps the call to #2 should only be done for full paths?)
>
No #2, no problem <wink>.

> If this makes sense, should it be added to Python's core?
>
+1.

-- 
Beni Cherniavsky <cben at tx.technion.ac.il>

Do not feed the Bugzillas.