os.walk the apostrophe and unicode

Michael Torrie torriem at gmail.com
Sun Jun 25 10:18:45 EDT 2017


On 06/25/2017 06:19 AM, Rod Person wrote:
> But doing a simple ls of that directory show it is unicode but the
> replacement of the offending character.
> 
> http://rodperson.com/graphics/uc/ls.png

Now that is really strange.  Your OS seems to not recognize that the
filename is in UTF-8.  I suspect this has something to do with the NAS
file sharing protocol (smb). Though I'm pretty sure that Samba can
handle UTF-8 filenames correctly.

> I am in fact using Python 3.5. I may be lacking in unicode skills but I
> do have the sense enough to know the version of Python I am invoking.
> So I included this screenshot of that so the version of Python and the
> files list returned by os.walk
> 
> http://rodperson.com/graphics/uc/files.png

If I create a file that has the U+2019 character in it on my Linux
machine (BtrFS), and do os.walk on it, I see the character in then
string properly.  So it looks like Python does the right thing,
automatically decoding from UTF-8.

In your situation I think the problem is the file sharing protocol that
your NAS is using. Somehow some information is being lost and your OS
does not know that the filenames are in UTF-8, and just thinks they are
bytes. And therefore Python doesn't know to decode the string, so you
just end up with each byte being converted to a unicode code point and
being shoved into the unicode string.

How to get around this issue I don't know.  Maybe there's a way to
convert the unicode string to bytes using the value of each character,
and then decode that back to unicode.



More information about the Python-list mailing list