unicode filenames

Andrew Dalke dalke at dalkescientific.com
Thu Feb 6 15:24:00 EST 2003


Beni Cherniavsky <cben at techunix.technion.ac.il>:
> unix must stay with the
> byte-orineted filenames at the low level.  This ensures that all programs
> that store file names in files, etc., continue to work.

> In the transition period, many people still use other encodings,
> sometimes different on different mounts.  Since filenames are
> frequently storedin files, programs will break if the filename
> encoding is different on different mountpoints.  If you suggest
> supporting that in programs, you effective require that all utilities
> like ls, find, xargs, etc. learn to convert filenames.

I don't know if you saw the example code I posted earlier.  My
suggestions were only meant for Python, hence the comments about
ls, find, xargs, etc. are of no concern.

I was also only concerned with low-level functions which deal with
the filesystem.  Right now for unix these take byte strings, not
unicode strings.  The behaviour I requested would only be triggered if
a unicode string was passed in, as in 'os.listdir(u".")', or if
os.getcwdu() was called.  Hence it should break no existing applications
because they don't pass in unicode filenames, except those which are
identically represented in 7-bit ASCII.

> So *please*, expect the user to configure all mounts to use the same
> encoding, the one he is using in his locale.  It's not hard.
> Otherwise he will not be able to work with other programs anyway...
> And that encoding best be UTF-8, of course.

In that I differed.  In my naive view, I had a registration system
for directory locations, so different mount points could have different
encodings.  Eg, I don't know if NFS mounts support unicode recoding.

>>    1) there is a default filesystem encoding, which is initialized
>>        to None if os.path.supports_unicode_file is True, otherwise
>>        it's set to sys.getdefaultencoding()
>>
> Yep.  For corner cases, it should be settable.  And I don't like the
> name, it should say "filename" instead of "file" (that prompts for
> shortening some other part, like "supports").

I have no ability to change that name -- it's already in Python 2.3.

I don't like 'filename' as it can also refer to a directory name.  OTOH,
few should care about this name so I don't think it's that important.

> One important point: files with names illegal in this encoding must
> not become inaccessible.  Instead of raising exceptions, Python's
> library should just fall back and return the byte string.

So have mixed unicode and byte strings returned from a function?  Yech.
Python's unicode functions have an error mode which can describe how
to handle this case.  I would rather the conversion functions allow
passing in that parameter as well.

If you want assess to the raw filesystem,

   os.listdir(unicode_string.encode("utf-8"))

should still return a list of raw byte strings.  My proposal is to
make it easy to handle unicode filenames for those who don't expect
to handle all the corner cases, and let fans of the details still
have access to those details, with a bit more work.

>>    2) there is a registration system which is used to define encodings
>>        used for different mount locations.  If a filename/dirname is
>>        not covered, sue the default filesystem encoding
>>
> No way!  See above.  Instead of fixing a couple of places (fstab,
> nfs&samba conf) you are trying to fix this in every single application
> running in the system.

No, I am not.  This is a per-Python registry only.  It would only used
by a very few people, as when building apps which try to handle the
filesytem in the best way it can.

I am not happy with it because I don't see a good way for it to
work.  I include it because it's the most general solution I could
consider, and I didn't want it to be ignored by accident.

>> If this makes sense, should it be added to Python's core?
>>
> +1.

Still waiting for Martin "Herr Unicode" van Löwis to comment.  I
hear he's on vacation ...

    	    	    	    	    	Andrew
    	    	    	    	    	dalke at dalkescientific.com




More information about the Python-list mailing list