[Python-Dev] PEP 277 (unicode filenames): please review

Guido van Rossum guido@python.org
Mon, 12 Aug 2002 16:07:46 -0400


> > http://www.python.org/peps/pep-0277.html
> >
> > The PEP describes a Windows-only change to Unicode in file names: On
> > Windows NT/2k/XP, Python would allow arbitrary Unicode strings as file
> > names and pass them to the OS, instead of converting them to CP_ACP
> > first. This applies to open() and all os functions that accept
> > filenames.
> >
> > In addition, os.list() would return Unicode filenames if the argument
> > is Unicode.
> 
> This is the bit I still don't like (at least, if I'm not 
> mistaken I commented on it a while ago too). A routine could be 
> doing an os.list() expecting strings, but suddenly someone 
> passes it a unicode directoryname and the return value would 
> change.

Hm, that would be the responsibility of whoever passes it Unicode.
Most code works just fine when presented with Unicode where 8-bit
strings are expected.  It's only code that assumes the 8-bit strings
are Latin-1 (or something else besides ASCII) that gets in trouble.

But shouldn't it return Unicode whenever there are filenames in the
directory that can't represented as ASCII?

That's what Tkinter does: Tk gives back UTF-8, which degenerates to
ASCII if there are only ASCII chars; if any high bits are detected,
Tkinter decodes the UTF-8, turning the return string into Unicode.

> I would much prefer an optional encoding argument whereby you 
> give the encoding in which you want the return value. Default 
> would be the local filesystem encoding. If you pass unicode you 
> will get direct unicode on XP/2K, and a converted string on 
> other platforms (but always unicode).

Hm, I don't know if I'd like os.listdir() to have an encoding
argument.  Sounds like the wrong solution somehow.

> Oh yes, the same reasoning would hold for readlink(), getcwd() 
> and any other call that returns filenames.

Ditto.

--Guido van Rossum (home page: http://www.python.org/~guido/)