Python 3.2 has some deadly infection

Sun Jun 1 21:14:40 EDT 2014

On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote:

> On 1 June 2014 12:26, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
> 
> 
>> "with cross-platform behavior preferred over system-dependent one" --
>> It's not clear how cross-platform behaviour has anything to do with the
>> Internet age. Python has preferred cross-platform behaviour forever,
>> except for those features and modules which are explicitly intended to
>> be interfaces to system-dependent features. (E.g. a lot of functions in
>> the os module are thin wrappers around OS features. Hence the name of
>> the module.)
>>
>>
> There is the behaviour of defaulting input and output to the system
> encoding. 

That's a tricky one, but I think on balance that is a case where 
defaulting to the system encoding is the right thing to do. Input and out 
occurs on the local system you are running on, which by definition isn't 
cross-platform. (Non-local I/O is possible, but requires work -- it 
doesn't just happen.)

> I personally think we would all be better off if Python (and
> Java, and many other languages) defaulted to UTF-8. This hopefully would
> eventually have the effect of producers changing to output UTF-8 by
> default, and consumers learning to manually specify an encoding when
> it's not UTF-8 (due to invalid codepoints).

UTF-8 everywhere should be our ultimate aim. Then we can forget about 
legacy encodings except when digging out ancient documents from archived 
floppy disks :-)

> I'm currently working on a product that interacts with lots of other
> products. These other products can be using any encoding - but most of
> the functions that interact with I/O assume the system default encoding
> of the machine that is collecting the data. The product has been in
> production for nearly a decade, so there's a lot of pushback against
> changes deep in the code for fear that it will break working systems.
> The fact that they are working largely by accident appears to escape
> them ...
> 
> FWIW, changing to use iso-latin-1 by default would be the most sensible
> option (effectively treating everything as bytes), with the option for
> another encoding if/when more information is known (e.g. there's often a
> call to return the encoding, and the output of that call is guaranteed
> to be ASCII).

Python 2 does what you suggest, and it is *broken*. Python 2.7 creates 
moji-bake, while Python 3 gets it right:

[steve at ando ~]$ python2.7 -c "print u'δжç'"
Î´Ð¶Ã§
[steve at ando ~]$ python3.3 -c "print(u'δжç')"
δжç

Latin-1 is one of those legacy encodings which needs to die, not to be 
entrenched as the default. My terminal uses UTF-8 by default (as it 
should), and if I use the terminal to input "δжç", Python ought to see 
what I input, not Latin-1 moji-bake.

If I were to use Windows with a legacy code page, then I couldn't even 
enter "δжç" on the command line since none of the legacy encodings 
support that set of characters at the same time. I don't know exactly 
what I would get if I tried (say, by copying and pasting text from a 
Unicode-aware application), but I'd see that it was weird *in the shell* 
before it even reaches Python.

On the other hand, if I were to input something supported by the legacy 
encoding, let's say I entered "αβγ" while using ISO-8859-7 (Greek), then 
Python ought to see "αβγ" and not moji-bake:

py> b = "αβγ".encode('iso-8859-7')  # what the shell generates
py> b.decode('latin-1')  # what Python interprets those bytes as
'áâã'

Defaulting to the system encoding means that Python input and output just 
works, to the degree that input and output on your system just works. If 
your system is crippled by the use of a legacy encoding, then Python will 
at least be *no worse* than your system.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/