os.walk the apostrophe and unicode

Sun Jun 25 03:53:31 EDT 2017

On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote:

> Steve D'Aprano wrote:
> 
>> On Sun, 25 Jun 2017 07:17 am, Peter Otten wrote:
>> 
>>> Then I'd fix the name manually...
>> 
>> The file name isn't broken.
>> 
>> 
>> What's broken is parts of the OP's code which assumes that non-ASCII file
>> names are broken...
> 
> Hm, the OP says
> 
> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'
> 
> Shouldn't it be
> 
> '06 - Todd’s Song (Post-Spiderland Song in Progress).flac'

It should, if the OP did everything right.

He has a file name containing the word "Todd’s":

# Python 3.5

py> fname = 'Todd’s'
py> repr(fname)
"'Todd’s'"

On disk, that is represented in UTF-8:

py> repr(fname.encode('utf-8'))
"b'Todd\\xe2\\x80\\x99s'"

The OP appears to be using Python 2, so when he calls os.listdir() he gets the
file names as bytes, not Unicode. That means he'll see:

- the file name will be Python 2 str, which is *byte string* not text string;
- so not Unicode
- rather the individual bytes in the UTF-8 encoding of the file name.

So in Python 2.7 instead of 3.5 above:

py> fname = u'Todd’s'
py> repr(fname)
"u'Todd\\u2019s'"
py> repr(fname.encode('utf-8'))
"'Todd\\xe2\\x80\\x99s'"

> if everything worked correctly? Though I don't understand why the OP doesn't
> see
> 
> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'
> 
> which is the repr() that I get.

That's mojibake and is always wrong :-) I'm not sure how you got that. Something
to do with an accidental decode to Latin-1?

# Python 2.7
py> repr(fname.encode('utf-8').decode('latin-1'))
"u'Todd\\xe2\\x80\\x99s'"

# Python 3.5
py> repr(fname.encode('utf-8').decode('latin-1'))
"'Toddâ\\x80\\x99s'"

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.