[Python-Dev] Adding the 'path' module (was Re: Some RFE for review)

Tue Jul 12 10:37:14 CEST 2005

Hi Neil,

>>>2) Return unicode when the text can not be represented in ASCII. This
>>>will cause a change of behaviour for existing code which deals with
>>>non-ASCII data.
>>
>>+1 on this one (s/ASCII/Python's default encoding).
> 
> 
>    I assume you mean the result of sys.getdefaultencoding() here.

Yes.

The default encoding is the encoding that Python assumes when
auto-converting a string to Unicode. It is normally set to ASCII,
but a user may want to use a different encoding.

However, we've always made it very clear that the user is on his
own when chainging the ASCII default to something else.

> Unless much of the Python library is modified to use the default
> encoding, this will break. The problem is that different implicit
> encodings are being used for reading data and for accessing files.
> When calling a function, such as open, with a byte string, Python
> passes that byte string through to Windows which interprets it as
> being encoded in CP_ACP. When this differs from
> sys.getdefaultencoding() there will be a mismatch.

As I said: code pages are evil :-)

>    Say I have been working on a machine set up for Australian English
> (or other Western European locale) but am working with Russian data so
> have set Python's default encoding to cp1251. With this simple script,
> g.py:
> 
> import sys
> print file(sys.argv[1]).read()
> 
>    I process a file called '€.txt' with contents "European Euro" to produce
> 
> C:\zed>python_d g.py €.txt
> European Euro
> 
>    With the proposed modification, sys.argv[1] u'\u20ac.txt' is
> converted through cp1251 

Actually, it is not: if you pass in a Unicode argument to
one of the file I/O functions and the OS supports Unicode
directly or at least provides the notion of a file system
encoding, then the file I/O should use the Unicode APIs
of the OS or convert the Unicode argument to the file system
encoding. AFAIK, this is how posixmodule.c already works
(more or less).

I was suggesting that OS filename output APIs such as os.listdir()
should return strings, if the filename matches the default
encoding, and Unicode, if not.

On input, file I/O APIs should accept both strings using
the default encoding and Unicode. How these inputs are then
converted to suit the OS is up to the OS abstraction layer, e.g.
posixmodule.c.

Note that the posixmodule currently does not recode string
arguments: it simply passes them to the OS as-is, assuming
that they are already encoded using the file system encoding.
Changing this is easy, though: instead of using the "et"
getargs format specifier, you'd have to use "es". The latter
recodes strings based on the default encoding assumption to
whatever other encoding you specify.

> to '\x88.txt' as the Euro is located at 0x88
> in CP1251. The operating system is then asked to open '\x88.txt' which
> it interprets through CP_ACP to be u'\u02c6.txt' ('ˆ.txt') which then
> fails. If you are very unlucky there will be a file called 'ˆ.txt' so
> the call will succeed and produce bad data.
> 
>    Simulating with str(sys.argvu[1]):
> 
> C:\zed>python_d g.py €.txt
> Traceback (most recent call last):
>   File "g.py", line 2, in ?
>     print file(str(sys.argvu[1])).read()
> IOError: [Errno 2] No such file or directory: '\x88.txt'

See above: this is what I'd consider a bug in posixmodule.c

>>-1: code pages are evil and the reason why Unicode was invented
>>in the first place. This would be a step back in history.
> 
> 
>    Features used to specify files (sys.argv, os.environ, ...) should
> match functions used to open and perform other operations with files
> as they do currently. This means their encodings should match.

Right. However, most of these APIs currently either don't
make any assumption on the strings contents and simply pass
them around, or they assume that these strings use the file
system encoding - which, like in the example you gave above,
can be different from the default encoding.

To untie this Gordian Knot, we should use strings and Unicode
like they are supposed to be used (in the context of text
data):

* strings are fine for text data that is encoded using
  the default encoding

* Unicode should be used for all text data that is not
  or cannot be encoded in the default encoding

Later on in Py3k, all text data should be stored in Unicode
and all binary data in some new binary type.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 12 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::