File names, character sets and Unicode
Steve Holden
steve at holdenweb.com
Fri Dec 12 08:21:16 EST 2008
Michal Ludvig wrote:
> Hi all,
>
> is there any way to determine what's the charset of filenames returned
> by os.walk()?
>
> The trouble is, if I pass <type 'str'> argument to os.walk() I get the
> filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows.
>
> OTOH If I pass <type 'unicode'> to os.walk() all the filenames I get in
> the loop are already unicode()d.
>
> However with some locales settings os.walk() dies with for example:
> Traceback (most recent call last):
> File "tst.py", line 10, in <module>
> for root, dirs, files in filelist:
> File "/usr/lib/python2.5/os.py", line 303, in walk
> for x in walk(path, topdown, onerror):
> File "/usr/lib/python2.5/os.py", line 293, in walk
> if isdir(join(top, name)):
> File "/usr/lib/python2.5/posixpath.py", line 65, in join
> path += '/' + b
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1:
> ordinal not in range(128)
>
> I can't even skip over these files with 'os.walk(..., onerror=handler)'
> the handler() is never called.
>
> That happens for instance when the file names have some non-ascii
> characters and locales are set to ascii, but reportedly in some other
> cases as well.
>
> What's the right and safe way to walk the filesystem and get some
> meaningful filenames?
>
>
> Related question - if the directory is given name on a command line
> what's the right way to preprocess the argument before passing it down
> to os.walk()?
>
> For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system):
> * directory is called 'smile☺'
> * sys.argv[1] will be 'smile\xe2\x98\xba' (type str)
> * after .decode("utf-8") I get u'smile\u263a' (type unicode)
>
> But how should I decode() it when running on a system where $LANG
> doesn't end with "UTF-8"? Apparently some locales have non-ascii default
> charsets. For instance zh_TW is BIG5 charset by default, ru_RU is
> ISO-8850-5, etc. How do I detect that to get the right charset for decode()?
>
> I tend to have everything internally in Unicode but it's often unclear
> how to convert some inputs to Unicode in the first place. What are the
> best practices for dealing with these chraset issues in Python?
>
There's currently a huge thread on python-dev dealing with (or rather
discussing) this very tortuous issue. Look for "Python-3.0, unicode, and
os.environ" in the archives. (The same issue, by the way, also applies
to environment variables).
In a nutshell, this is likely to cause pain until all file systems are
standardized on a particular encoding of Unicode. Probably only about
another fifteen years to go ...
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
More information about the Python-list
mailing list