Newbie question about text encoding
Albert-Jan Roskam
fomcl at yahoo.com
Sat Mar 7 14:03:34 EST 2015
--- Original Message -----
> From: Chris Angelico <rosuav at gmail.com>
> To:
> Cc: "python-list at python.org" <python-list at python.org>
> Sent: Saturday, March 7, 2015 6:26 PM
> Subject: Re: Newbie question about text encoding
>
> On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> See:
>>
>> $ mkdir /tmp/xyz
>> $ touch /tmp/xyz/
>> \x80'
>> $ python3
>> Python 3.3.2 (default, Dec 4 2014, 12:49:00)
>> [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
>> Type "help", "copyright", "credits" or
> "license" for more information.
>> >>> import os
>> >>> os.listdir('/tmp/xyz')
>> ['\udc80']
>> >>> open(os.listdir('/tmp/xyz')[0])
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> FileNotFoundError: [Errno 2] No such file or directory:
> '\udc80'
>>
>> File names encoded with Latin-X are quite commonplace even in UTF-8
>> locales.
>
> That is not a problem with UTF-8, though. I don't understand how
> you're blaming UTF-8 for that. There are two things happening here:
>
> 1) The underlying file system is not UTF-8, and you can't depend on
> that, ergo the decode to Unicode has to have some special handling of
> failing bytes.
> 2) You forgot to put the path on that, so it failed to find the file.
> Here's my version of your demo:
>
>>>> open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r'
> encoding='UTF-8'>
>
> Looks fine to me.
>
> Alternatively, if you pass a byte string to os.listdir, you get back a
> list of byte string file names:
>
>>>> os.listdir(b"/tmp/xyz")
> [b'\x80']
Nice, I did not know that. And glob.glob works the same way: it returns a list of ustrings when given a ustring, and returns bstrings when given a bstring.
More information about the Python-list
mailing list