[Python-Dev] Unicode strings as filenames

Skip Montanaro skip@pobox.com (Skip Montanaro)
Thu, 3 Jan 2002 17:11:10 -0600


>>>>> "Martin" =3D=3D Martin v Loewis <martin@v.loewis.de> writes:

    >> What's the correct way to deal with filenames in a Unicode
    >> environment?  Consider this:
    >>
    >> >>> import site site.encoding
    >> 'latin-1'

    Martin> Setting site.encoding is certainly the wrong thing to do. H=
ow
    Martin> can you know all users of your system use latin-1?

Why is setting site.encoding appropriate to your environment at the tim=
e you
install Python wrong?  I can't know that all users of my system (whatev=
er
the definition of "my system" is) will use latin-1.  Somewhere along th=
e way
I have to make some assumptions, however.

    On any given computer I assume the people who install Python will s=
et
    site.encoding appropriate to their environment.

    The example I used was latin-1 simply because the folks I'm working=
 with
    are in Austria and they came up with the example.  I assume the bes=
t
    default encoding for them is latin-1.

    The application writers themselves will have no problem restricting=

    internal filenames to be ascii.  I assume it users want to save fil=
es of
    their own, they will choose characters from the Unicode character s=
et
    they use most frequently.

So, my example used latin-1.  I could just as easily have chosen someth=
ing
else.

    Martin> On my system, the following works fine

    Martin> >>> import locale ; locale.setlocale(locale.LC_ALL,"")
    Martin> 'LC_CTYPE=3Dde_DE;LC_NUMERIC=3Dde_DE;LC_TIME=3Dde_DE;LC_COL=
LATE=3DC;LC_MONETARY=3Dde_DE;LC_MESSAGES=3Dde_DE;LC_PAPER=3Dde_DE;LC_NA=
ME=3Dde_DE;LC_ADDRESS=3Dde_DE;LC_TELEPHONE=3Dde_DE;LC_MEASUREMENT=3Dde_=
DE;LC_IDENTIFICATION=3Dde_DE'
    Martin> >>> a =3D "abc\xe4\xfc\xdf.txt" u =3D unicode (a, "latin-1"=
) open(u, "w")
    Martin> <open file 'abc=E4=FC=DF.txt', mode 'w' at 0x8173e88>

    Martin> On Unix, your best bet for file names is to trust the user'=
s
    Martin> locale settings. If you do that, open will accept Unicode
    Martin> objects.

    Martin> What is your locale?

The above setlocale call prints

    'LC_CTYPE=3Den_US;LC_NUMERIC=3Den_US;LC_TIME=3Den_US;LC_COLLATE=3De=
n_US;LC_MONETARY=3Den_US;LC_MESSAGES=3Den_US;LC_PAPER=3Den;LC_NAME=3Den=
;LC_ADDRESS=3Den;LC_TELEPHONE=3Den;LC_MEASUREMENT=3Den;LC_IDENTIFICATIO=
N=3Den'

I can't get to the machines in Austria right now to see how their local=
es
are set, though I suspect they haven't fiddled their LC_* environment,
because they are having the problems I described.

    >> Is that the correct approach?  Apparently Python's file object
    >> doesn't do this under the covers.  Should it?

    Martin> No. There is no established convention, on Unix, how to do
    Martin> non-ASCII file names. If anything, following the user's loc=
ale
    Martin> setting is the most reasonable thing to do; this should be =
in
    Martin> synch of how the user's terminal displays characters. The P=
ython
    Martin> installations' default encoding is almost useless, and shou=
ldn't
    Martin> be changed.

    Martin> On Windows, things are much better, since there a notion of=

    Martin> Unicode file names in the system.

This suggests to me that the Python docs need some introductory materia=
l on
this topic.  It appears to me that there are two people in the Python
community who live and breathe this stuff are you, Martin, and Marc-And=
r=E9.
For most of the rest of us, especially if we've never conciously writte=
n
code for consumption outside an ascii environment, the whole thing just=

looks like a quagmire.

Skip