os.walk the apostrophe and unicode

Rod Person rodperson at rodperson.com
Sat Jun 24 15:47:25 EDT 2017


On Sat, 24 Jun 2017 13:28:55 -0600
Michael Torrie <torriem at gmail.com> wrote:

> On 06/24/2017 12:57 PM, Rod Person wrote:
> > Hi,
> > 
> > I'm working on a program that will walk a file system and clean the
> > id3 tags of mp3 and flac files, everything is working great until
> > the follow file is found
> > 
> > '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
> > 
> > for some reason that I can't understand os.walk() returns this file
> > name as
> > 
> > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
> > Progress).flac'  
> 
> That's basically a UTF-8 string there:
> 
> $ python3
> >>> a= b'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in  
> Progress).flac'
> >>> print (a.decode('utf-8'))  
> 06 - Todd’s Song (Post-Spiderland Song in Progress).flac
> >>>  
> 
> The NAS is just happily reading the UTF-8 bytes and passing them on
> the wire.
> 
> > which then causes more hell than a little bit for me. I'm not
> > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> > about it.  
> 
> It's clearly not an apostrophe in the original filename, but probably
> U+2019 (’)
> 
> > The script is Python 3, the file system it is running on is a hammer
> > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS
> > which runs some kind of Linux so it probably ext3/4. The files came
> > from various system (Mac, Windows, FreeBSD).  
> 
> It's the file serving protocol that dictates how filenames are
> transmitted. In your case it's probably smb. smb (samba) is just
> passing the native bytes along from the file system.  Since you know
> the native file system is just UTF-8, you can just decode every
> filename from utf-8 bytes into unicode.

This is the impression that I was under, my unicode is that strong, so
maybe my understand is off...but I tried.

	file_name = file_name.decode('utf-8', 'ignore')

but when I get to my logging code:

	logfile.write(file_name)

that throws the error:
	UnicodeEncodeError: 'ascii' codec can't encode characters in
	position 39-41: ordinal not in range(128)


-- 
Rod

http://www.rodperson.com

Who at Clitorius fountain thirst remove 
Loath Wine and, abstinent, meer Water love.

 - Ovid



More information about the Python-list mailing list