LC_ALL and os.listdir()

Wed Feb 23 16:07:19 EST 2005

Kenneth Pronovici wrote:
>    1) Why LC_ALL has any effect on the os.listdir() result? 

The operating system (POSIX) does not have the inherent notion
that file names are character strings. Instead, in POSIX, file
names are primarily byte strings. There are some bytes which
are interpreted as characters (e.g. '\x2e', which is '.',
or '\x2f', which is '/'), but apart from that, most OS layers
think these are just bytes.

Now, most *people* think that file names are character strings.
To interpret a file name as a character string, you need to know
what the encoding is to interpret the file names (which are byte
strings) as character strings.

There is, unfortunately, no operating system API to carry
the notion of a file system encoding. By convention, the locale
settings should be used to establish this encoding, in particular
the LC_CTYPE facet of the locale. This is defined in the
environment variables LC_CTYPE, LC_ALL, and LANG (searched
in this order).

>    2) Why only 3 of the 4 files come back as unicode strings?

If LANG is not set, the "C" locale is assumed, which uses
ASCII as its file system encoding. In this locale,
'\xe2\x99\xaa\xe2\x99\xac' is not a valid file name (atleast
it cannot be interpreted as characters, and hence not
be converted to Unicode).

Now, your Python script has requested that all file names
*should* be returned as character (ie. Unicode) strings, but
Python cannot comply, since there is no way to find out what
this byte string means, in terms of characters.

So we have three options:
1. skip this string, only return the ones that can be
    converted to Unicode. Give the user the impression
    the file does not exist.
2. return the string as a byte string
3. refuse to listdir altogether, raising an exception
    (i.e. return nothing)

Python has chosen alternative 2, allowing the application
to implement 1 or 3 on top of that if it wants to (or
come up with other strategies, such as user feedback).

> 3) The proper "general" way to deal with this situation?

You can chose option 1 or 3; you could tell the user
about it, and then ignore the file, you could try to
guess the encoding (UTF-8 would be a reasonable guess).

> My goal is to build generalized code that consistently works with all
> kinds of filenames.

Then it is best to drop the notion that file names are
character strings (because some file names aren't). You
do so by converting your path variable into a byte
string. To do that, you could try

path = path.encode(sys.getfilesystemencoding())

This should work in most cases; Python will try to
determine the file system encoding from the environment,
and try to encode the file. Notice, however:

- on some systems, getfilesystemencoding may return None,
   if the encoding could not be determined. Fall back
   to sys.getdefaultencoding in this case.
- depending on where you got path from, this may
   raise a UnicodeError, if the user has entered a
   path name which cannot be encoding in the file system
   encoding (the user may well believe that she has
   such a file on disk).

So your code would read

try:
   path = path.encode(sys.getfilesystemencoding() or
                      sys.getdefaultencoding())
except UnicodeError:
   print >>sys.stderr, "Invalid path name", repr(path)
   sys.exit(1)

> Ultimately, all I'm trying to do is copy some files
> around.  I'd really prefer to find a programmatic way to make this work
> that was independent of the user's configured locale, if possible.

As long as you manage to get a byte string from the path
entered, all should be fine.

Regards,
Martin