os.walk the apostrophe and unicode

Sat Jun 24 15:28:55 EDT 2017

On 06/24/2017 12:57 PM, Rod Person wrote:
> Hi,
> 
> I'm working on a program that will walk a file system and clean the id3
> tags of mp3 and flac files, everything is working great until the
> follow file is found
> 
> '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
> 
> for some reason that I can't understand os.walk() returns this file
> name as
> 
> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'

That's basically a UTF-8 string there:

$ python3
>>> a= b'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
Progress).flac'
>>> print (a.decode('utf-8'))
06 - Todd’s Song (Post-Spiderland Song in Progress).flac
>>>

The NAS is just happily reading the UTF-8 bytes and passing them on the
wire.

> which then causes more hell than a little bit for me. I'm not
> understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> about it.

It's clearly not an apostrophe in the original filename, but probably
U+2019 (’)

> The script is Python 3, the file system it is running on is a hammer
> filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which
> runs some kind of Linux so it probably ext3/4. The files came from
> various system (Mac, Windows, FreeBSD).

It's the file serving protocol that dictates how filenames are
transmitted. In your case it's probably smb. smb (samba) is just passing
the native bytes along from the file system.  Since you know the native
file system is just UTF-8, you can just decode every filename from utf-8
bytes into unicode.