os.walk the apostrophe and unicode

Sat Jun 24 16:34:58 EDT 2017

On 2017-06-24 20:47, Rod Person wrote:
> On Sat, 24 Jun 2017 13:28:55 -0600
> Michael Torrie <torriem at gmail.com> wrote:
> 
>> On 06/24/2017 12:57 PM, Rod Person wrote:
>> > Hi,
>> > 
>> > I'm working on a program that will walk a file system and clean the
>> > id3 tags of mp3 and flac files, everything is working great until
>> > the follow file is found
>> > 
>> > '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
>> > 
>> > for some reason that I can't understand os.walk() returns this file
>> > name as
>> > 
>> > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
>> > Progress).flac'  
>> 
>> That's basically a UTF-8 string there:
>> 
>> $ python3
>> >>> a= b'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in  
>> Progress).flac'
>> >>> print (a.decode('utf-8'))  
>> 06 - Todd’s Song (Post-Spiderland Song in Progress).flac
>> >>>  
>> 
>> The NAS is just happily reading the UTF-8 bytes and passing them on
>> the wire.
>> 
>> > which then causes more hell than a little bit for me. I'm not
>> > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
>> > about it.  
>> 
>> It's clearly not an apostrophe in the original filename, but probably
>> U+2019 (’)
>> 
>> > The script is Python 3, the file system it is running on is a hammer
>> > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS
>> > which runs some kind of Linux so it probably ext3/4. The files came
>> > from various system (Mac, Windows, FreeBSD).  
>> 
>> It's the file serving protocol that dictates how filenames are
>> transmitted. In your case it's probably smb. smb (samba) is just
>> passing the native bytes along from the file system.  Since you know
>> the native file system is just UTF-8, you can just decode every
>> filename from utf-8 bytes into unicode.
> 
> This is the impression that I was under, my unicode is that strong, so
> maybe my understand is off...but I tried.
> 
> 	file_name = file_name.decode('utf-8', 'ignore')
> 
> but when I get to my logging code:
> 
> 	logfile.write(file_name)
> 
> that throws the error:
> 	UnicodeEncodeError: 'ascii' codec can't encode characters in
> 	position 39-41: ordinal not in range(128)
> 
> 
Your logfile was opened with the 'ascii' encoding, so you can't write 
anything outside the ASCII range.

Open it with the 'utf-8' encoding instead.