Newbie question about text encoding

Sat Mar 7 12:50:20 EST 2015

Chris Angelico <rosuav at gmail.com>:

> On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> File names encoded with Latin-X are quite commonplace even in UTF-8
>> locales.
>
> That is not a problem with UTF-8, though. I don't understand how
> you're blaming UTF-8 for that.

I'm saying it creates practical problems. There's a snake in the
paradise.

> There are two things happening here:
>
> 1) The underlying file system is not UTF-8, and you can't depend on
> that,

Correct. Linux pathnames are octet strings regardless of the locale.

That's why Linux developers should refer to filenames using bytes.
Unfortunately, Python itself violates that principle by having
os.listdir() return str objects (to mention one example).

> 2) You forgot to put the path on that, so it failed to find the file.
> Here's my version of your demo:
>
>>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'>
>
> Looks fine to me.

I stand corrected.

Then we have:

   >>> os.listdir()[0].encode('utf-8')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
   position 0: surrogates not allowed

Marko