unicode filenames

Sun Feb 2 19:14:43 EST 2003

Okay, I'm confused.  I've been working my way through the changes
put into Python 2.3.  One of these is PEP 277, "Unicode file name
support for Windows NT" at http://www.python.org/peps/pep-0277.html .

I decided to experiment with how to use unicode filenames.  I
thought I understood, until I tried it out.

How do I deal with possibly unicode filenames in a platform
independent manner?

I normally use unix.  What's the right way to treat filenames
under that OS?  As Latin-1?  Or UTF-8?  As far as I can tell,
filenames are simply bytes, so I can make whatever interpretation
I want on the characters, and the standard viewpoint is to
interpret those characters as Latin-1.

[dalke at zebulon src]$ ./python
Python 2.3a1 (#7, Feb  2 2003, 15:54:30)
[GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> open("spårvägen", "w").close()
 >>> ^D
[dalke at zebulon src]$ ls -l sp*
-rw-r--r--    1 dalke    users           0 Feb  2 16:19 spårvägen
[dalke at zebulon src]$ ls sp* | od -c
0000000   s   p   å   r   v   ä   g   e   n  \n
0000012
[dalke at zebulon src]$ ./python
Python 2.3a1 (#7, Feb  2 2003, 15:54:30)
[GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> s = unicode("spårvägen", "latin-1")
 >>> s
u'sp\xe5rv\xe4gen'
 >>> open(s, "w").close()
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ue5' in 
position 2: ordinal not in range(128)
 >>> s.encode("utf8")
'sp\xc3\xa5rv\xc3\xa4gen'
 >>> open(s.encode("utf8"), "w").close()
 >>> ^D
[dalke at zebulon src]$ ls -l sp*
-rw-r--r--    1 dalke    users           0 Feb  2 16:19 spårvägen
-rw-r--r--    1 dalke    users           0 Feb  2 16:22 spÃ¥rvÃ¤gen

Does that mean Unix filenames can't contain non-Latin-1 characters?
Or does it mean I need to get the info on how to interpret the
filename using something from the current environment?
(sys.getdefaultencoding() doesn't work since that reports 'ascii'
for me.)  Could different directories be encoded differently, eg
    /home/<encoded in ASCII>/<encoded in Latin-1>/<encoded in big5> ?

And what happens when a remote file is mounted, say, from a MS
Windows OS?  Are they represented as UTF-8?  Something else?
Is that standardized or is it a property of the mount mechanism
and can change accordingly?

Okay, now let's see what changed in Python 2.3.  According to
Andrew Kuchling's "What's new in Python 2.3" at
   http://www.python.org/doc/2.3a1/whatsnew/node5.html

      On Windows NT, 2000, and XP, the system stores file names
      as Unicode strings. Traditionally, Python has represented
      file names as byte strings, which is inadequate because it
      renders some file names inaccessible.

      Python now allows using arbitrary Unicode strings (within
      the limitations of the file system) for all functions that
      expect file names, most notably the open() built-in function.
      If a Unicode string is passed to os.listdir(), Python now
      returns a list of Unicode strings. A new function,
      os.getcwdu(), returns the current directory as a Unicode string.
         ...
      Other systems also allow Unicode strings as file names but
      convert them to byte strings before passing them to the system,
      which can cause a UnicodeError to be raised. Applications can
      test whether arbitrary Unicode strings are supported as file
      names by checking os.path.unicode_file_names, a Boolean value.

Indeed, on my Linux system, "os.path.supports_unicode_filenames" (I've
sent in a bug report on the difference in attribute names) is False.

Still, 'os.getcwdu()' does exist and works for my Linux system.  (I
removed the 'spårvägen' files and restarted Python.)

 >>> import os
 >>> os.path.supports_unicode_filenames
False
 >>> os.getcwdu()
u'/home/dalke/cvses/python/dist/src'
 >>> os.mkdir("spårvägen")
 >>> os.chdir("spårvägen")
 >>> os.getcwdu()
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 36: 
ordinal not in range(128)
 >>>

This seems to imply that if I want to know the current working
directory (eg, to display in a GUI widget, which understands how to
display unicode strings) my code needs to work like this:

   if os.path.supports_unicode_filenames:
     cwd = os.getcwdu()
   else:
     encoding = .. get default filesystem encoding ... or 'latin-1'
     cwd = unicode(os.getcwd(), encoding)

Ugly .. quite ugly.  And suggestions on the proper way to
handle this is not documented as far as I can find.

Next I want to display the files in that directory.  For MS
Windows it looks like I can do that with the unicode string,
as in

   os.listdir(cwd)

but with unix I need to do

   [unicode(x, encoding) for x in os.listdir(s.encode(encoding))]

so the portable code to list the files in a directory is something
like this

   def my_listdir(dirname, filesystem_encoding = "ascii"):
     if os.path.supports_unicode_filename:
       return os.listdir(dirname)
     enc = filesystem_encoding
     return [unicode(x, enc) for x in os.listdir(dirname.encode(enc))]

Again, that seems rather ugly, since I need to roll my own code
to get what I believe to be platform independence.

Similar problems hold true for mkdir and other functions.  Eg,
I get a unicode string from the user which is a directory to
create.  To work for both MS Windows and non-MS Windows machines,
I need to do

    def my_mkdir(dirname, filesystem_encoding = "ascii"):
      if os.path.supports_unicode_filenames):
        os.path.mkdir(filename)
      else:
        os.path.mkdir(filename.encode(filesystem_encoding))

(possibly with some error catching if the new filename is in
Thai for a Latin-1 filesystem.)

In other words, it seems that I need to write a wrapper for
all functions which might take a unicode string so when
supports_unicode_filename is False I convert it to the appropriate
default filesystem encoding.

(Again, if different directory components can be in different
character sets then this doesn't work.  But I don't think anyone
can reasonable expect that.)

It seems, in my naive view of unicode, that there should be a
system-wide function to get/set the default filesystem encoding,
and the Python functions to mkdir, listdir, rmdir, etc. should
use that encoding when a Unicode string is passed in to them,
and that the default encoding be ASCII as it is now.

But as I said, I am naive about unicode, so this post is meant
as a shout for help from those more experienced, to clear up
my own confusion.

					Andrew
					dalke at dalkescientific.com