unicode filenames

Paul Boddie paul at boddie.net
Mon Feb 3 06:35:54 EST 2003


Andrew Dalke <adalke at mindspring.com> wrote in message news:<b1kc9o$vf1$1 at slb9.atl.mindspring.net>...
> 
> I normally use unix.  What's the right way to treat filenames
> under that OS?  As Latin-1?  Or UTF-8?  As far as I can tell,
> filenames are simply bytes, so I can make whatever interpretation
> I want on the characters, and the standard viewpoint is to
> interpret those characters as Latin-1.

It may be locale-based on Linux, at least, and possibly on other UNIX
platforms, too.

> [dalke at zebulon src]$ ls sp* | od -c
> 0000000   s   p   å   r   v   ä   g   e   n  \n
> 0000012

I hadn't heard of 'od' before, so this is a useful piece of
information. When accessing Red Hat Linux 7.3 on Intel with locale as
en_US.iso885915, I can apparently create filenames with ISO-8859-15
characters, and in the terminal program I'm using, these characters
appear as question marks when switching locale to en_US.utf8. However,
in the former locale, 'od -c' returns the characters as part of the
"dump", whereas in the latter, 'od -c' returns the octal codes for
those characters.

What is interesting is that if I try to remove the file in UTF-8 mode,
it succeeds, even though the byte encoding of the filename should
really be different from what it was before. Moreover, if I create a
file with ISO-8859-15-encodable characters in UTF-8 mode, it seems to
use the ISO-8859-15 byte values.

Perhaps the "UTF-8 and Unicode FAQ" and the manual might be of help:

  man unicode

Still, I see your point about it being harder to use non-ASCII
characters in filenames on UNIX with the upcoming Python 2.3. In many
environments, this is a highly unsatisfactory situation.

Paul




More information about the Python-list mailing list