File names, character sets and Unicode
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Fri Dec 12 06:08:09 EST 2008
On Fri, 12 Dec 2008 23:32:27 +1300, Michal Ludvig wrote:
> is there any way to determine what's the charset of filenames returned
> by os.walk()?
No. Especially under *nix file systems file names are just a string of
bytes, not characters. It is possible to have file names in different
encondings in the same directory.
> The trouble is, if I pass <type 'str'> argument to os.walk() I get the
> filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows.
Nobody knows. :-)
> What's the right and safe way to walk the filesystem and get some
> meaningful filenames?
The safe way is to use `str`.
> Related question - if the directory is given name on a command line
> what's the right way to preprocess the argument before passing it down
> to os.walk()?
Pass it as is.
> For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system): * directory is
> called 'smile☺'
> * sys.argv[1] will be 'smile\xe2\x98\xba' (type str) * after
> .decode("utf-8") I get u'smile\u263a' (type unicode)
>
> But how should I decode() it when running on a system where $LANG
> doesn't end with "UTF-8"? Apparently some locales have non-ascii default
> charsets. For instance zh_TW is BIG5 charset by default, ru_RU is
> ISO-8850-5, etc. How do I detect that to get the right charset for
> decode()?
You can't. Even if you know the preferred encoding of the system, e.g.
via $LANG, there is no guarantee that all file names are encoded this way.
> I tend to have everything internally in Unicode but it's often unclear
> how to convert some inputs to Unicode in the first place. What are the
> best practices for dealing with these chraset issues in Python?
I'm usually using UTF-8 as default but offer the user ways, e.g. command
line switches, to change that.
If I have to display file names in a GUI I use a decoded version of the
byte string file name, but keep the byte string for operations on the
file.
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list