Newbie question about text encoding

Sat Mar 7 12:59:56 EST 2015

On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> There are two things happening here:
>>
>> 1) The underlying file system is not UTF-8, and you can't depend on
>> that,
>
> Correct. Linux pathnames are octet strings regardless of the locale.
>
> That's why Linux developers should refer to filenames using bytes.
> Unfortunately, Python itself violates that principle by having
> os.listdir() return str objects (to mention one example).

Only because you gave it a str with the path name. If you want to
refer to file names using bytes, then be consistent and refer to ALL
file names using bytes. As I demonstrated, that works just fine.

>> 2) You forgot to put the path on that, so it failed to find the file.
>> Here's my version of your demo:
>>
>>>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
>> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'>
>>
>> Looks fine to me.
>
> I stand corrected.
>
> Then we have:
>
>    >>> os.listdir()[0].encode('utf-8')
>    Traceback (most recent call last):
>      File "<stdin>", line 1, in <module>
>    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
>    position 0: surrogates not allowed

So?

ChrisA