[Python-3000] Unicode and OS strings

Wed Sep 19 00:52:18 CEST 2007

On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote:
> If they contain
> non-ASCII bytes I am currently in favor os doing a best-effort
> decoding using the default locale encoding, replacing errors with '?'
> rather than throwing exception.

One of the more common things to do with command line arguments is  
open them. So, it'd really be nice if:

python -c 'import sys; open(sys.argv[1])' [some filename]

would always work, regardless of the current system encoding and what  
characters make up the filename.  Note that filenames are essentially  
random binary gunk in most Unix systems; the encoding is unspecified,  
and there can in fact be multiple encodings, even for different  
directories making up a single file's path.

I'd like to propose that python simply assume the external world is  
likely to be UTF-8, and always decode command-line arguments (and  
environment vars), and encode for filesystem operations using the  
roundtrip-able UTF-8b. Even if the system says its encoding is  
iso-2022 or some other abomination. This has upsides (simple, doesn't  
trample on PUA codepoints, only needs one new codec, never throws  
exception in the above example, and really is correct much of the  
time), and downsides (if the system locale is iso-2022, and all the  
filenames you're dealing with really are also properly encoded in  
iso-2022, it might be nice if they decoded into the sensible unicode  
string, instead of a non-sensical (but still round-trippable) one.

I think the advantages outweigh the disadvantages, but the world I  
live in, using anything other than UTF8 or ASCII is grounds for entry  
into an insane asylum. ;)

James