Newbie question about text encoding

Chris Angelico rosuav at gmail.com
Sat Mar 7 13:13:09 EST 2015


On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers <dan at tombstonezero.net> wrote:
> On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:
>
>> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>
>>> Correct. Linux pathnames are octet strings regardless of the locale.
>>>
>>> That's why Linux developers should refer to filenames using bytes.
>>> Unfortunately, Python itself violates that principle by having
>>> os.listdir() return str objects (to mention one example).
>>
>> Only because you gave it a str with the path name. If you want to
>> refer to file names using bytes, then be consistent and refer to ALL
>> file names using bytes. As I demonstrated, that works just fine.
>
> Python 3.4.2 (default, Oct  8 2014, 10:45:20)
> [GCC 4.9.1] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import os
>>>> type(os.listdir(os.curdir)[0])
> <class 'str'>

Help on module os:

DESCRIPTION
    This exports:
      - os.curdir is a string representing the current directory ('.' or ':')
      - os.pardir is a string representing the parent directory ('..' or '::')

Explicitly documented as strings. If you want to work with strings,
work with strings. If you want to work with bytes, don't use
os.curdir, use bytes instead. Personally, I'm happy using strings, but
if you want to go down the path of using bytes, you simply have to be
consistent, and that probably means being platform-dependent anyway,
so just use b"." for the current directory.

Normally, using Unicode strings for file names will work just fine.
Any name that you craft yourself will be correctly encoded for the
target file system (or UTF-8 if you can't know), and any that you get
back from os.listdir or equivalent will be usable in file name
contexts. What else can you do with a file name that isn't encoded the
way you expect it to be? Unless you have some out-of-band encoding
information, you can't do anything meaningful with the stream of
bytes, other than keeping it exactly as it is.

ChrisA



More information about the Python-list mailing list