[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

Martin v. Löwis report at bugs.python.org
Sun Oct 10 18:22:28 CEST 2010


Martin v. Löwis <martin at v.loewis.de> added the comment:

Am 10.10.2010 17:51, schrieb STINNER Victor:
> 
> STINNER Victor <victor.stinner at haypocalc.com> added the comment:
> 
>> We run into problems because we have two inconsistent encodings,
>> ...
> 
> What? No. We have problems because we don't use the same encoding to
> decode and to encode the same data type. It's not a problem to use a
> different encoding for each data type (stdout, filenames, environment
> variables, ...).

This is exactly the very problem that we face. In particular, the
question is what encoding to use if something is *both* a filename
and an environment variable value, or both a filename and a command
line argument.

> Mac OS X is a special case. Filesystem encoding is utf-8 on this OS,
> whereas the locale encoding depends on LANG variable. If I understood
> MvL proposition correctly, we should not rely on the locale on Mac OS
> X.

"Not rely on" is perhaps a bit harsh. It's not clear (to me) under what
conditions the locale's encoding will be more correct than just assuming
UTF-8 - there may actually be use cases for it.

However, with the surrogate escapes, we could just always decode using
UTF-8, and leave any mojibake problems that may arise from this from
this to the application. I do think that these problems will be rare,
since a) many OSX installations use UTF-8, anyway, and b) those that
don't likely experience the proper round-tripping of the escape mechanism.

> So the "3rd encoding" and the filesystem encodings should be
> hardcoded to utf-8?

That's an option to consider, yes - I'd like an OSX expert to
comment.

> The "third encoding" is no more controlable by a special environment
> variable, only by classic locale environment variables (LC_ALL,
> LC_CTYPE, LANG). Is it a problem? I remember a comment from MAL
> saying that it may be a problem for CGI for the environment variables
> because some (all?) variables are not encoded with the locale
> encoding (but the HTML encoding?). I don't know if Python should
> workaround CGI specific issues. In Python 3.2, we have now
> os.environb: it's now possible to use a different encoding for each
> variable.

I think these problems are sufficiently resolved now: either by
PEP 3333, PEP 444, PEP 383, or os.environb.

I think you misunderstood MAL's comment, though: the environment
variables are not encoded in *any* specific encoding. Instead,
they are copied literally from the HTTP request, using whatever
bytes the browser originally put in there - which may or may
not have followed a particular encoding. HTTP is silent on
this most of the time, and HTML is out of scope.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9992>
_______________________________________


More information about the Python-bugs-list mailing list