python 3.1 unicode question

Wed Sep 16 13:21:54 EDT 2009

jeffunit <jeff at jeffunit.com> wrote:

>>That looks like a "surrogate escape" (See PEP 383) 
>>http://www.python.org/dev/peps/pep-0383/.  It indicates the wrong 
>>encoding was used to decode the filename.
> 
> That seems likely. How do I set the encoding to something correct to 
> decode the filename?
> 
> Clearly windows knows how to display it.
> I suspect since I complied python with cygwin, that it is using a 
> POSIX standard,
> rather than a windows specific standard. Of course ideally, I would 
> like my code to work
> on linux as well as windows, as I back up all of my data to a linux 
> machine with
> samba.
> 
If you are running on a Linux system then the filenames are stored encoded 
as bytes but the system does not store the encoding. In fact different 
files in the same directory could use different encodings. That's why 
Python 3.1 uses the surrogate escapes so that you can at least work with 
the files even if you can't display the filenames.

If you are running on Windows and using the native Python to access an NTFS 
formatted partition then there shouldn't be a problem: the filenames are 
stored as unicode and Python uses the unicode apis. Of course you may still 
not be able to display the filenames if they contain characters not 
available in your output codepage.

If you use cygwin a quick search on Google turned up some old discussions 
implying that it uses the 8 bit apis which convert characters using the 
current codepage and converts characters it cannot handle to '?' but I have 
no idea if that still applies.