File names, character sets and Unicode

Fri Dec 12 08:21:16 EST 2008

Michal Ludvig wrote:
> Hi all,
> 
> is there any way to determine what's the charset of filenames returned
> by os.walk()?
> 
> The trouble is, if I pass <type 'str'> argument to os.walk() I get the
> filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows.
> 
> OTOH If I pass <type 'unicode'> to os.walk() all the filenames I get in
> the loop are already unicode()d.
> 
> However with some locales settings os.walk() dies with for example:
> Traceback (most recent call last):
>   File "tst.py", line 10, in <module>
>     for root, dirs, files in filelist:
>   File "/usr/lib/python2.5/os.py", line 303, in walk
>     for x in walk(path, topdown, onerror):
>   File "/usr/lib/python2.5/os.py", line 293, in walk
>     if isdir(join(top, name)):
>   File "/usr/lib/python2.5/posixpath.py", line 65, in join
>     path += '/' + b
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1:
> ordinal not in range(128)
> 
> I can't even skip over these files with 'os.walk(..., onerror=handler)'
> the handler() is never called.
> 
> That happens for instance when the file names have some non-ascii
> characters and locales are set to ascii, but reportedly in some other
> cases as well.
> 
> What's the right and safe way to walk the filesystem and get some
> meaningful filenames?
> 
> 
> Related question - if the directory is given name on a command line
> what's the right way to preprocess the argument before passing it down
> to os.walk()?
> 
> For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system):
> * directory is called 'smile☺'
> * sys.argv[1] will be 'smile\xe2\x98\xba' (type str)
> * after .decode("utf-8") I get u'smile\u263a' (type unicode)
> 
> But how should I decode() it when running on a system where $LANG
> doesn't end with "UTF-8"? Apparently some locales have non-ascii default
> charsets. For instance zh_TW is BIG5 charset by default, ru_RU is
> ISO-8850-5, etc. How do I detect that to get the right charset for decode()?
> 
> I tend to have everything internally in Unicode but it's often unclear
> how to convert some inputs to Unicode in the first place. What are the
> best practices for dealing with these chraset issues in Python?
> 
There's currently a huge thread on python-dev dealing with (or rather
discussing) this very tortuous issue. Look for "Python-3.0, unicode, and
os.environ" in the archives. (The same issue, by the way, also applies
to environment variables).

In a nutshell, this is likely to cause pain until all file systems are
standardized on a particular encoding of Unicode. Probably only about
another fifteen years to go ...

regards
 Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/