File names, character sets and Unicode

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Fri Dec 12 06:08:09 EST 2008


On Fri, 12 Dec 2008 23:32:27 +1300, Michal Ludvig wrote:

> is there any way to determine what's the charset of filenames returned
> by os.walk()?

No.  Especially under *nix file systems file names are just a string of 
bytes, not characters.  It is possible to have file names in different 
encondings in the same directory.

> The trouble is, if I pass <type 'str'> argument to os.walk() I get the
> filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows.

Nobody knows.  :-)

> What's the right and safe way to walk the filesystem and get some
> meaningful filenames?

The safe way is to use `str`.

> Related question - if the directory is given name on a command line
> what's the right way to preprocess the argument before passing it down
> to os.walk()?

Pass it as is.

> For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system): * directory is
> called 'smile☺'
> * sys.argv[1] will be 'smile\xe2\x98\xba' (type str) * after
> .decode("utf-8") I get u'smile\u263a' (type unicode)
> 
> But how should I decode() it when running on a system where $LANG
> doesn't end with "UTF-8"? Apparently some locales have non-ascii default
> charsets. For instance zh_TW is BIG5 charset by default, ru_RU is
> ISO-8850-5, etc. How do I detect that to get the right charset for
> decode()?

You can't.  Even if you know the preferred encoding of the system, e.g. 
via $LANG, there is no guarantee that all file names are encoded this way.

> I tend to have everything internally in Unicode but it's often unclear
> how to convert some inputs to Unicode in the first place. What are the
> best practices for dealing with these chraset issues in Python?

I'm usually using UTF-8 as default but offer the user ways, e.g. command 
line switches, to change that.

If I have to display file names in a GUI I use a decoded version of the 
byte string file name, but keep the byte string for operations on the 
file.

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list