os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

Ross Ridge rridge at csclub.uwaterloo.ca
Tue Nov 21 01:28:41 EST 2006


Martin v. Löwis wrote:
> Then I would use U+E000 for escaping. Each PUA character in the
> listed file name would get escaped with U+E000 in the Python
> string; when the file name is converted back to the system, it
> gets unescaped.

How would you tell an escaped file name containing these private use
characters obtain from os.listdir() from an unescaped file name
containing these characters obtained from some other source?

> Notice that I think this is a really unrealistic case - I expect
> that all file names containing PUA characters were deliberately
> crafted to investigate using PUA characters in file names.

I suspect a more common case is file names containing end-user defined
characters.

> What Far-East multi-byte encoding uses PUA characters,
> and for what characters?

Pretty much all of them near as I can tell.  See the following WWW page
for a discusion of this issue:

   http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html

> On no operating system I'm aware of can you pass "Unicode strings" to
> open() or stat().

*sigh*  I was refering to the Python functions "open() and stat() etc."
just as you had in paragraph I copied those exact words from.

> > On Windows you can use GetVolumeInformation()...
>
> On Windows, the entire issue doesn't exist:

On Windows, I think you should use GetVolumeInformation() to decide
whether or not os.listdir() returns Unicode or byte strings, rather
than the type of the argument.

>> ... but then the Python docs suggests this
> > os.listdir() Unicode feature doesn't work on Macintosh systems anyways.
>
> Either the docs are wrong, or you are misinterpreting them. It works
> just fine in practice.

As the original poster in this thread wrote, the docs say:

   On Windows NT/2k/XP and Unix, if path is a Unicode
   object, the result will be a list of Unicode objects

The implication being that Macintosh systems don't support this
feature.

> > That's the problem here, there's no
> > encoding associated Unix filenames, they're just byte strings.
>
> Can you please quote chapter and verse of the POSIX spec that says
> so?

I said Unix, not POSIX.  In practice, Unix systems don't associate an
encoding with filenames, and any byte value other than '/' or '\0' is
permitted in a filename.  Not that it really matters, Python byte
strings are also the natural way to respresent file names stored in a
unspecified encoding.

                                            Ross Ridge




More information about the Python-list mailing list