[Python-Dev] Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson nyamatongwe at gmail.com
Tue Jul 12 08:53:24 CEST 2005


M.-A. Lemburg:

> > 2) Return unicode when the text can not be represented in ASCII. This
> > will cause a change of behaviour for existing code which deals with
> > non-ASCII data.
> 
> +1 on this one (s/ASCII/Python's default encoding).

   I assume you mean the result of sys.getdefaultencoding() here.
Unless much of the Python library is modified to use the default
encoding, this will break. The problem is that different implicit
encodings are being used for reading data and for accessing files.
When calling a function, such as open, with a byte string, Python
passes that byte string through to Windows which interprets it as
being encoded in CP_ACP. When this differs from
sys.getdefaultencoding() there will be a mismatch.

   Say I have been working on a machine set up for Australian English
(or other Western European locale) but am working with Russian data so
have set Python's default encoding to cp1251. With this simple script,
g.py:

import sys
print file(sys.argv[1]).read()

   I process a file called '€.txt' with contents "European Euro" to produce

C:\zed>python_d g.py €.txt
European Euro

   With the proposed modification, sys.argv[1] u'\u20ac.txt' is
converted through cp1251 to '\x88.txt' as the Euro is located at 0x88
in CP1251. The operating system is then asked to open '\x88.txt' which
it interprets through CP_ACP to be u'\u02c6.txt' ('ˆ.txt') which then
fails. If you are very unlucky there will be a file called 'ˆ.txt' so
the call will succeed and produce bad data.

   Simulating with str(sys.argvu[1]):

C:\zed>python_d g.py €.txt
Traceback (most recent call last):
  File "g.py", line 2, in ?
    print file(str(sys.argvu[1])).read()
IOError: [Errno 2] No such file or directory: '\x88.txt'

> -1: code pages are evil and the reason why Unicode was invented
> in the first place. This would be a step back in history.

   Features used to specify files (sys.argv, os.environ, ...) should
match functions used to open and perform other operations with files
as they do currently. This means their encodings should match.

   Neil


More information about the Python-Dev mailing list