Newbie question about text encoding

Albert-Jan Roskam fomcl at yahoo.com
Sat Mar 7 14:03:34 EST 2015



--- Original Message -----

> From: Chris Angelico <rosuav at gmail.com>
> To: 
> Cc: "python-list at python.org" <python-list at python.org>
> Sent: Saturday, March 7, 2015 6:26 PM
> Subject: Re: Newbie question about text encoding
> 
> On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>  See:
>> 
>>     $ mkdir /tmp/xyz
>>     $ touch /tmp/xyz/
>>  \x80'
>>     $ python3
>>     Python 3.3.2 (default, Dec  4 2014, 12:49:00)
>>     [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
>>     Type "help", "copyright", "credits" or 
> "license" for more information.
>>     >>> import os
>>     >>> os.listdir('/tmp/xyz')
>>     ['\udc80']
>>     >>> open(os.listdir('/tmp/xyz')[0])
>>     Traceback (most recent call last):
>>       File "<stdin>", line 1, in <module>
>>     FileNotFoundError: [Errno 2] No such file or directory: 
> '\udc80'
>> 
>>  File names encoded with Latin-X are quite commonplace even in UTF-8
>>  locales.
> 
> That is not a problem with UTF-8, though. I don't understand how
> you're blaming UTF-8 for that. There are two things happening here:
> 
> 1) The underlying file system is not UTF-8, and you can't depend on
> that, ergo the decode to Unicode has to have some special handling of
> failing bytes.
> 2) You forgot to put the path on that, so it failed to find the file.
> Here's my version of your demo:
> 
>>>>  open("/tmp/xyz/"+os.listdir('/tmp/xyz')[0])
> <_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' 
> encoding='UTF-8'>
> 
> Looks fine to me.
> 
> Alternatively, if you pass a byte string to os.listdir, you get back a
> list of byte string file names:
> 
>>>>  os.listdir(b"/tmp/xyz")

> [b'\x80']

Nice, I did not know that. And glob.glob works the same way: it returns a list of ustrings when given a ustring, and returns bstrings when given a bstring.



More information about the Python-list mailing list