[Patches] [ python-Patches-683592 ] unicode support for os.listdir()

Tue, 25 Feb 2003 09:22:44 -0800

Patches item #683592, was opened at 2003-02-09 16:43
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=683592&group_id=5470

Category: Library (Lib)
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Just van Rossum (jvr)
Assigned to: Nobody/Anonymous (nobody)
Summary: unicode support for os.listdir()

Initial Comment:
The attached patch makes os.listdir() return unicode strings, on plaforms that have Py_FileSystemDefaultEncoding defined as non-NULL.

I'm by no means sure this is the right thing to do; it does seem right on OSX where Py_FileSystemDefaultEncoding is (or rather: will be real soon, I'm waiting for Jack's approval) utf-8. I'd be happy to add the code in an OSX-specific switch.

A more subtle variant could perhaps only return unicode strings if the file name is not ASCII.

----------------------------------------------------------------------

>Comment By: Guido van Rossum (gvanrossum)
Date: 2003-02-25 12:22

Message:
Logged In: YES 
user_id=6380

OK, check it in, just be prepared for contingencies. I
really cannot judge whether this is right on all platforms.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2003-02-25 10:55

Message:
Logged In: YES 
user_id=92689

Having missed 2.3a2, I'd like to get this in way ahead of 2.3b1. Any objections?

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2003-02-10 13:17

Message:
Logged In: YES 
user_id=92689

I'm pretty sure os.path deals just fine with unicode strings (it's all pure string manipulations, isn't it?)

Worries: well, apparently on Windows os.listdir() has been returning unicode for some time, so it's not like we're breaking completely new grounds here.

If anything breaks it's probably good this happens, as it gives an opportunity to fix things... I just found several example of potential breakage: _bsddb.c parses a filename arg with the "z" format specifier. gdbmmodule.c uses "s". bsddbmodule.c and dbmmodule.c as well.

I'm not sure the above modules work on Windows with non-ascii filenames at all, but it doesn't look like it. Besides Windows (for which my patch is not relevant), only OSX sets Py_FileSystemDefaultEncoding, so any new breakage won't reach a mass market right away <wink>.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-10 12:46

Message:
Logged In: YES 
user_id=38388

Ok, let's look at it from a different
angle: things that you get from os.listdir() should be
compatible 
to (at least) all the os.path tools and os itself.
Converting to 
Unicode has the advantage that slicing and indexing into the
path names will not break the paths (unlike UTF-8 encoded 8-bit
strings which tend to break when you slice them).

That said, I think you're right about the ASCII approach
provided
that the os, os.path tools can actually properly cope with
Unicode.

What I worry about is that if os.listdir() gives back
Unicode for
e.g. Latin-1 filenames and the application then passes the
Unicode
names to a C API using "s", prefectly working code will break...
then again the C code should really use "es" for decoding to
the Py_FileSystemDefaultEncoding as is done in e.g.
fileobject.c.

I really don't know what to do here...

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2003-02-10 11:24

Message:
Logged In: YES 
user_id=92689

Here's an argument for ASCII and against the default encoding: if the default encoding is different from Py_FileSystemDefaultEncoding, things go wrong: an 8-bit string passed to file() will be interpreted as Py_FileSystemDefaultEncoding (more precisely: will not be interpreted at all), not the default encoding...

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-10 06:24

Message:
Logged In: YES 
user_id=38388

Right, except that injecting Unicode into Unicode-unaware code
can be dangerous (e.g. some code might require a string object
to work on).

E.g. if someone sets the default encoding to Latin-1 he wouldn't
expect os.listdir() to suddenly return Unicode for him.

This may be a problem in general for the change to os.listdir().
We'll just have to see what happens during the alpha and beta
phases.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2003-02-10 06:08

Message:
Logged In: YES 
user_id=92689

On the other hand, if it's not ASCII, wouldn't a unicode string be more appropriate to begin with? If it's encodable with the default encoding, this will happen as soon as the string is used in a piece of unicode-unaware code, right?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-10 05:55

Message:
Logged In: YES 
user_id=38388

Good question. The default encoding would better fit 
into the concept, I guess.

Instead of PyUnicode_AsASCIIString(v) you'd
have to use PyUnicode_AsEncodedString(v, NULL, "strict").

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2003-02-10 05:49

Message:
Logged In: YES 
user_id=92689

Ok, I went for your original suggestion: always convert to unicode and then try to convert to ascii. See new patch. Or should this use the default encoding? Hm.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-10 05:17

Message:
Logged In: YES 
user_id=38388

The file system does not need to support embedded \0 chars
even if it supports UTF-16. It only happens that your test
assumes
that you have one byte per characters encodings which may not
always be true. With UTF-16 your test will see lots of \0 bytes
but not necessarily ones which are ord(x)>=128.

I'm not sure whether other variable length encodings can result
in \0 bytes, e.g. the Asian ones. 

There's also the possibility of the
encoding mapping the ASCII range to other non-ASCII characters,
e.g. ShiftJIS does this for the Yen sign.

If you absolutely want to use the simple test, I'd at least
restrict
the test to an ASCII isalnum(x) test and then try the
encode/decode 
method I described if this test fails.

Note that isalnum() can be locale dependent on some
platforms, so
you have to hard-code it.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2003-02-10 04:51

Message:
Logged In: YES 
user_id=92689

I don't see hot UTF-16 could be a valid value for Py_FileSystemDefaultEncoding, as for most platforms the file name can't contain null bytes. My looking at the NAMELEN() spaghetti, it seems platforms without HAVE_DIRENT_H might still support embedded null bytes. Any wisdom on this?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-10 04:24

Message:
Logged In: YES 
user_id=38388

Your test will probably catch most cases, but it could fail
for e.g. UTF-16.

The only true test would be to first convert to Unicode and then
try to convert back to ASCII. If you get an error you can be
sure that
the text is not ASCII compatible. Given that .listdir()
involves lots of
IO I think the added performance hit wouldn't be noticable.

----------------------------------------------------------------------

Comment By: Just van Rossum (jvr)
Date: 2003-02-10 04:12

Message:
Logged In: YES 
user_id=92689

Applied both suggestions.

However, I'm not sure if my ASCII test does the right thing, or at least I don't think it does if Py_FileSystemDefaultEncoding is not a superset of ASCII.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2003-02-09 22:07

Message:
Logged In: YES 
user_id=33168

The code which uses unicode APIs should probably be wrapped 
with:

#ifdef Py_USING_UNICODE
 /* code */
#endif

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2003-02-09 20:16

Message:
Logged In: YES 
user_id=6380

At the very least, I'd like it to return Unicode only when
the original string isn't just ASCII.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=683592&group_id=5470