os.walk the apostrophe and unicode

MRAB python at mrabarnett.plus.com
Sat Jun 24 15:25:53 EDT 2017


On 2017-06-24 19:57, Rod Person wrote:
> Hi,
> 
> I'm working on a program that will walk a file system and clean the id3
> tags of mp3 and flac files, everything is working great until the
> follow file is found
> 
> '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
> 
> for some reason that I can't understand os.walk() returns this file
> name as
> 
> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'
> 
> which then causes more hell than a little bit for me. I'm not
> understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> about it.
> 
> The script is Python 3, the file system it is running on is a hammer
> filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which
> runs some kind of Linux so it probably ext3/4. The files came from
> various system (Mac, Windows, FreeBSD).
> 
If you treat it as a bytestring b'\xe2\x80\x99' and decode it:

 >>> c = b'\xe2\x80\x99'.decode('utf-8')
 >>> ascii(c)
"'\\u2019'"
 >>> import unicodedata
 >>> unicodedata.name(c)
'RIGHT SINGLE QUOTATION MARK'

It's not an apostrophe, it's '\u2019' ('\N{RIGHT SINGLE QUOTATION MARK}').

It looks like the filename is encoded as UTF-8, but Python thinks that 
the filesystem encoding is something like Latin-1.



More information about the Python-list mailing list