[Python-Dev] Python-3.0, unicode, and os.environ

Toshio Kuratomi a.badger at gmail.com
Fri Dec 5 18:37:14 CET 2008


Victor Stinner wrote:
> Hi,
> 
> Le Thursday 04 December 2008 21:02:19 Toshio Kuratomi, vous avez écrit :
> 
>> These mixed encodings can occur for a variety of reasons.  Here's an
>> example that isn't too contrived :-)
>> (...)
>> Furthermore, they don't want to suffer from the space loss of using 
>> utf-8 to encode Japanese so they use shift-jis everywhere.
> 
> "space loss"? Really? If you configure your server correctly, you should get 
> UTF-8 even if the file system is Shift-JIS. But it would be much easier to 
> use UTF-8 everywhere.
> 
> Hum... I don't think that the discussion is about one specific server, but the 
> lack of bytes environment variables in Python3 :-)
>
Yep.  I can't change the logicalness of the policies of a different
organization, only code my application to deal with it :-)

>> 1) return mixed unicode and byte types in ...
> 
> NO!
> 
It's nice that we agree... but I would prefer if you leave enough
context so that others can see that we agree as well :-)

>> 2) return only byte types in os.environ
> 
> Hum... Most users have UTF-8 everywhere (eg. all Windows users ;-)), and 
> Python3 already use Unicode everywhere (input(), open(), filenames, ...).
>
We're also in agreement here.

>> 3) silently ignore non-decodable value when accessing os.environ['PATH']
>> as we do now but allow access to the full information via
>> os.environ[b'PATH'] and os.getenvb()
> 
> I don't like os.environ[b'PATH']. I prefer to always get the same result 
> type... But os.listdir() doesn't respect that :-(
> 
>    os.listdir(str) -> list of str
>    os.listdir(bytes) -> list of bytes
> 
> I would prefer a similar API for easier migration from Python2/Python3
> (unicode). os.environb sounds like the best choice for me.
> 
<nod>.  After thinking about how it would be used in subprocess calls I
agree.  os.environb would allow us to retrieve the full dict as bytes.
os.environ[b''] only works on individual keys.  Also os.getenv serves
the same purpose as os.environ[b''] would whereas os.environb would have
 its own uses.

> 
> But they are open questions (already asked in the bug tracker):
> 
I answered these in the bug tracker.  Here are the answers for the
mailing list:

> (a) Should os.environ be updated if os.environb is changed? If yes, how?
>    os.environb['PATH'] = '\xff' (or any invalid string in the system 
>                                  default encoding)
>    => os.environ['PATH'] = ???
> 
The underlying environment that both variables reflect should be updated
but what is displayed by os.environ should continue to follow the same
rules.  So if we follow option #3::
     os.environb['PATH'] = b'\xff'
     os.environ['PATH'] => raises KeyError because PATH is not a key in
the unicode decoded environment.

(option #4 would issue a UnicodeDecodeError instead of a KeyError)

Similarly, if you start with a variable in os.environb that can only be
represented as bytes and your program transforms it into something that
is decodable it should then show up in os.environ.

> (b) Should os.environb be updated if os.environ is changed? If yes, how?
> 
> The problem comes with non-Unicode locale (eg. latin-1 or ASCII): most charset 
> are unable to encode the whole Unicode charset (eg. codes >= 65535).
> 
>    os.environ['PATH'] = chr(0x10000)
>    => os.environb['PATH'] = ???
>
Ah, this is a good question.  I misunderstood what you were getting at
when you posted this to the bug report.  I see several options but the
one that seems the most sane is to raise UnicodeEncodeError when setting
the value.  With that, proper code to set an environment variable might
look like this::

LANG=C python3.0
>>> variable = chr(0x10000)
>>> try:
>>>     # Unicode aware locales
>>>     os.environ['MYVAR'] = variable
>>> except UnicodeEncodeError:
>>>     # Non-Unicode locales
>>>     os.environb['MYVAR'] = bytes(variable, encoding='utf8')

> (c) Same question when a key is deleted (del os.environ['PATH']).
> 
Update the underlying env so both os.environ and os.environb reflect the
change.  Deleting should not hold the problems that updating does.

> If Python 3.1 will have os.environ and os.environb, I'm quite sure that some 
> modules will user os.environ and other will prefer os.environb. If both 
> environments are differents, the two modules set will work differently :-/
> 
Exactly.  So making sure they hold the same information is a priority.

> It would be maybe easier if os.environ supports bytes and unicode keys. But we 
> have to keep these assertions:
>    os.environ[bytes] -> bytes
>    os.environ[str] -> str
> 
I think the same choices have to be made here.  If LANG=C, we still have
to decide what to do when os.environ[str] is set to a non-ASCii string.

Additionally, the subprocess question makes using the key value
undesirable compared with having a separate os.environb that accesses
the same underlying data.

>> 4) raise an exception when non-decodable values are *accessed* and
>> continue as in #3.
> 
> I like os.listdir() behaviour: just *ignore* non-decodable files. If you 
> really want to access these files, use a bytes directory name ;-)
> 
Since you wrote the code for that I would hope so ;-)

Here's my problem with it, though.  With these semantics any program
that works on arbitrary files and runs on *NIX has to check
os.listdir(b'') and do the conversion manually.  The only code that
doesn't have to care is code that is working on files that the program
created and thus controls.

Since it is not obvious that this has to be done most programs won't do
this by default, there will be subtle bugs in a lot of code that
individual application authors will have to discover and change when a
user realizes something is wrong.  Since there's no traceback being
issued, the process of discovery and debugging will be longer.

>> I think that the ease of debugging is lost when we silently ignore an error.
> 
> Guido gave a good example. If your directory contains an non decodable 
> filename (eg. "???.txt"): glob('*.py') will fail because of the evil 
> filename. With the current behaviour, you're unable to list all files but 
> glob('*.py') will list all Python scripts!
> 
Current behaviour is this:

os.listdir('.')   => Only decodable filenames
glob.glob('*')    => Only decodable filenames
os.listdir(b'.')  => All filenames as bytes
glob.glob(b'*')   => All filenames as bytes

I think the desired behaviour assuming the existence of anondecodable
file is this:

os.listdir('.')    => traceback
glob.glob('*')     => traceback
os.listdir(b'.')   => All filenames as bytes
glob.glob(b'*')    => All filenames as bytes

Both of these approaches are internally consistent.  Why do you think
that glob.glob('*.py') is special and should not traceback?

> And Python3 is released, it's maybe a bad idea to change the behaviour (of 
> os.environ) in Python 3.1 :-/
> 
As you've pointed out, os.environ will have to change slightly.  But
others have already said that this is on the agenda to fix in 3.1.  The
current state is just broken as the environment is currently only
partially readable from python.

>> The bug report I opened suggests creating a PEP to address this issue.
> 
> Please, try to answer to my questions about os.environ and os.environb 
> consistency.
> 
I have.  Twice now :-)

> I also like bytes environment variables. I need them for my fuzzing program. 
> The lack of bytes variables is a regression from Python2 (for my program). On 
> UNIX, filenames are bytes and the environment variables are bytes. For the 
> best interoperability, Python3 should support bytes. But the default choice 
> should always be characters (unicode) and to never mix the bytes and str 
> types ;-)
> 
I agree 100%.

* Never mixing bytes and str is a *huge* benefit of python3 over python2.
* Unicode str everywhere possible is a python3 benefit that helps to get
conversion done at the border.

I just differ in that I think lack of tracebacks when
UnicodeDecodeErrors are encountered is a wart in python3 that did not
exist in python2.

-Toshio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-dev/attachments/20081205/93727f86/attachment.pgp>


More information about the Python-Dev mailing list