os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

Mon Nov 20 13:10:57 EST 2006

Martin v. Löwis wrote:
> One approach I had been considering is to always make the decoding
> succeed, by using the private-use-area of Unicode to represent bytes
> that don't decode correctly.

Ross Ridge schrieb:
> That would conflict with private use characters appearing in file
> names.

Martin v. Löwis wrote:
> Not necessarily: they could get escaped.

How?

> AFAICT, you can have that conflict only if the file system encoding
> is UTF-8: otherwise, there is no way to represent them.

They can also appear UTF-16 filenames (obviously) and various Far-East
multi-byte encodings.

> > Personally, I think os.listdir() should return the file names only in
> > Unicode if they're actually stored that way in the underlying file
> > system (eg. NTFS), otherwise return them as byte strings.  I doubt
> > anyone in this thread would like that, though.
>
> So I assume you would not want to allow to pass Unicode strings
> to open(), stat() etc. either, as the _real_ file system API requires
> byte strings there, as well?

No, I just expect that if the underlying file system API does accept a
given byte or Unicode string that I could pass the same string to
open() and stat(), etc.. and have it work.  I have no problem if
additional strings happen to work because Python converts byte strings
to Unicode or vice-versa as the API requires.

Should I assume that since you think that having "os.listdir()" return
Unicode strings when passed a Unicode directory name is a good idea,
that you also think that file object methods (eg. readline) should
return Unicode strings when opened with a Unicode filename?

> Technically, how do you determine whether the underlying file
> system stores file names "in Unicode"?

On Windows you can use GetVolumeInformation(), though it may be more
practical to assume Unicode or byte strings based on the OS.  On Unix
you'd assume byte strings.

> Does OSX use Unicode (it requires path names to be UTF-8)?

HFS+ uses Unicode.  I have no idea how you'd figure out the properties
of a filesystem under OS/X, but then the Python docs suggests this
os.listdir() Unicode feature doesn't work on Macintosh systems anyways.

> After all, each and every encoding is a Unicode encoding - that was a design
> goal of Unicode.

If it were as simple as that, then yes, there wouldn't be a problem.
Unfortunately, as this thread has revealed, os.llistdir() isn't always
able to map byte string filenames into Unicode, either because they
don't use the assumed encoding, don't all use the same encoding or
don't use any standard encoding.  That's the problem here, there's no
encoding associated Unix filenames, they're just byte strings.  Since
Python byte strings also have no encoding associated with them they're
the natural way of representing all valid file names on Unix systems.
On the other hand, under Windows NT/2K/XP and NTFS or VFAT the natural
way to represent all valid file names is Unicode strings.

                                            Ross Ridge