os.walk the apostrophe and unicode

Sun Jun 25 10:28:17 EDT 2017

Rod Person wrote:

> Ok...so after reading all the replies in the thread, I thought I would
> be easier to send a general reply and include some links to screenshots.
> 
> As Peter mention, the logic thing to do would be to fix the file name
> to what I actually thought it was and if this was for work that
> probably what I would have done, but since I want to understand what's
> going on I decided to waste time on that.
> 
> I have to admit, I didn't think the file system was utf-8 as seeing what
> looked to be an apostrophe sent me down the road of why is this
> apostrophe screwed up instead of "ah this must be unicode".
> 
> But doing a simple ls of that directory show it is unicode but the
> replacement of the offending character.
> 
> http://rodperson.com/graphics/uc/ls.png

Have you set LANG to something that implies ASCII?

$ touch Todd’s ähnlich üblich löblich
$ ls
ähnlich  löblich  Todd’s  üblich
$ LANG=C ls
Todd???s  l??blich  ??hnlich  ??blich
$ python3 -c 'import os; print(os.listdir())'
['Todd’s', 'üblich', 'ähnlich', 'löblich']
$ LANG=C python3 -c 'import os; print(os.listdir())'
['Todd\udce2\udc80\udc99s', '\udcc3\udcbcblich', '\udcc3\udca4hnlich', 
'l\udcc3\udcb6blich']
$ LANG=en_US.utf-8 python3 -c 'import os; print(os.listdir())'
['Todd’s', 'üblich', 'ähnlich', 'löblich']

For file names Python resorts to surrogates whenever a byte does not 
translate into a character in the advertised encoding.

> I am in fact using Python 3.5. I may be lacking in unicode skills but I
> do have the sense enough to know the version of Python I am invoking.

I've made so many "stupid errors" myself that I always consider them first 
;)

> So I included this screenshot of that so the version of Python and the
> files list returned by os.walk
> 
> http://rodperson.com/graphics/uc/files.png
> 
> So the fact that it shows as a string and not bytes in the debugger was
> throwing me for a loop, in my log section I was trying to determine if
> it was unicode decode it...if not don't do anything which wasn't working
> 
> http://rodperson.com/graphics/uc/log_section.png
> 
> 
> 
> 
> On Sun, 25 Jun 2017 10:47:18 +0200
> Peter Otten <__peter__ at web.de> wrote:
> 
>> Steve D'Aprano wrote:
>> 
>> > On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote:
>> 
>> >> if everything worked correctly? Though I don't understand why the
>> >> OP doesn't see
>> >> 
>> >> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>> >> 
>> >> which is the repr() that I get.
>> > 
>> > That's mojibake and is always wrong :-)
>> 
>> Yes, that's my very point.
>> 
>> > I'm not sure how you got that.
>> 
>> I took the OP's string at face value and pasted it into the
>> interpreter:
>> 
>> # python 3.4
>> >>> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
>> >>> Progress).flac'
>> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>> 
>> > Something to do with an accidental decode to Latin-1?
>> 
>> If the above filename is the only one or one of a few that seem
>> broken, and other non-ascii filenames look OK the OP's
>> toolchain/filesystem may work correctly and the odd name might have
>> been produced elsewhere, e. g. by copying an already messed-up
>> freedb.org entry.
>> 
>> [Heureka]
>> 
>> However, the most likely explanation is that the filename is correct
>> and that the OP is not using Python 3 as he claims but Python 2.
>> 
>> Yes, it took that long for me to realise ;) Python 2 is slowly
>> sinking into oblivion...
>> 
> 
> 
>