[Python-Dev] Adding the 'path' module (was Re: Some RFE for review)

M.-A. Lemburg mal at egenix.com
Thu Jul 14 13:04:46 CEST 2005


Hi Neil,

>>>   With the proposed modification, sys.argv[1] u'\u20ac.txt' is
>>>converted through cp1251
>>
>>Actually, it is not: if you pass in a Unicode argument to
>>one of the file I/O functions and the OS supports Unicode
>>directly or at least provides the notion of a file system
>>encoding, then the file I/O should use the Unicode APIs
>>of the OS or convert the Unicode argument to the file system
>>encoding. AFAIK, this is how posixmodule.c already works
>>(more or less).
> 
> 
>    Yes it is. The initial stage is reading the command line arguments.
> The proposed modification is to change behaviour when constructing
> sys.argv, os.environ or when calling os.listdir to "Return unicode
> when the text can not be represented in Python's default encoding". I
> take this to mean that when the value can be represented in Python's
> default encoding then it is returned as a byte string in the default
> encoding.
> 
>    Therefore, for the example, the code that sets up sys.argv has to
> encode the unicode command line argument into cp1251.

Ok, I missed your point about sys.argv *not* returning Unicode
in this particular case.

However, with the modification of having posixmodule
and fileobject recode string input via Unicode (based on the
default encoding) into the file system encoding by basically
just changing the parser marker from "et" to "es", you
get correct behaviour - even in the above case.

Both posixmodule and fileobject would then take the cp1251
default encoded string, convert it to Unicode and then
to the file system encoding before opening the file.

>>On input, file I/O APIs should accept both strings using
>>the default encoding and Unicode. How these inputs are then
>>converted to suit the OS is up to the OS abstraction layer, e.g.
>>posixmodule.c.
> 
> 
>    This looks to me to be insufficiently compatible with current
> behaviour whih accepts byte strings outside the default encoding.
> Existing code may call open("€.txt"). This is perfectly legitimate
> current Python (with a coding declaration) as "€.txt" is a byte string
> and file systems will accept byte string names. Since the standard
> default encoding is ASCII, should such code raise UnicodeDecodeError?

Yes.

The above proposed change is indeed more restrictive than
the current pass-through approach. I'm not sure whether we
can impose such a change on the users in the 2.x series...
perhaps we should have a two phase approach:

Phase 1:
   try "et" and if this fails with an UnicodeDecodeError,
   revert back to the old "es" pass-through approach, issuing
   a warning as non-disruptive signal to the user

Phase 2:
   move to "et" for good and issue decode errors

>>Changing this is easy, though: instead of using the "et"
>>getargs format specifier, you'd have to use "es". The latter
>>recodes strings based on the default encoding assumption to
>>whatever other encoding you specify.
> 
>    Don't you want to convert these into unicode rather than another
> byte string encoding? It looks to me as though the "es" format always
> produces byte strings and the only byte string format that can be
> passed to the operating system is the file system encoding which may
> not contain all the characters in the default encoding.

If the OS support Unicode directly, we can (and do) have a
special case that bypasses the recoding altogheter. However,
this currently only appears to be available on Windows
versions NT, XP and up, where we already support this.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 14 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list