[Python-Dev] Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson nyamatongwe at gmail.com
Wed Jul 13 02:57:49 CEST 2005


   Hi Marc-Andre,

> >    With the proposed modification, sys.argv[1] u'\u20ac.txt' is
> > converted through cp1251
> 
> Actually, it is not: if you pass in a Unicode argument to
> one of the file I/O functions and the OS supports Unicode
> directly or at least provides the notion of a file system
> encoding, then the file I/O should use the Unicode APIs
> of the OS or convert the Unicode argument to the file system
> encoding. AFAIK, this is how posixmodule.c already works
> (more or less).

   Yes it is. The initial stage is reading the command line arguments.
The proposed modification is to change behaviour when constructing
sys.argv, os.environ or when calling os.listdir to "Return unicode
when the text can not be represented in Python's default encoding". I
take this to mean that when the value can be represented in Python's
default encoding then it is returned as a byte string in the default
encoding.

   Therefore, for the example, the code that sets up sys.argv has to
encode the unicode command line argument into cp1251.

> On input, file I/O APIs should accept both strings using
> the default encoding and Unicode. How these inputs are then
> converted to suit the OS is up to the OS abstraction layer, e.g.
> posixmodule.c.

   This looks to me to be insufficiently compatible with current
behaviour whih accepts byte strings outside the default encoding.
Existing code may call open("€.txt"). This is perfectly legitimate
current Python (with a coding declaration) as "€.txt" is a byte string
and file systems will accept byte string names. Since the standard
default encoding is ASCII, should such code raise UnicodeDecodeError?

> Changing this is easy, though: instead of using the "et"
> getargs format specifier, you'd have to use "es". The latter
> recodes strings based on the default encoding assumption to
> whatever other encoding you specify.

   Don't you want to convert these into unicode rather than another
byte string encoding? It looks to me as though the "es" format always
produces byte strings and the only byte string format that can be
passed to the operating system is the file system encoding which may
not contain all the characters in the default encoding.

   Neil


More information about the Python-Dev mailing list