Python 3 is killing Python

Wed Jul 16 10:39:17 EDT 2014

Chris Angelico <rosuav at gmail.com>:

> On Wed, Jul 16, 2014 at 11:11 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> I would be especially wary of letting Python 3 interpret those files for
>> me. [...]
>
> If you're reading your own config files, you can simply stipulate that
> they are to be encoded UTF-8, and if they're not, you throw an error.
> Simple! Works with the easy way of opening files in Python.

That's my point! It does not work.

   $ python3 -c "
   > import sys
   > sys.stdout.write(sys.stdin.read())" <<<"Hyvää yötä"
   Hyvää yötä
   $ LANG=en_US.ASCII python3 -c "
   > import sys
   > sys.stdout.write(sys.stdin.read())" <<<"Hyvää yötä"
   Traceback (most recent call last):
     File "<string>", line 3, in <module>
     File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
       return codecs.ascii_decode(input, self.errors)[0]
   UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3\
   : ordinal not in range(128)

In other words, the well-meaning Python3 blindly obeys the locale even
though I "simply stipulated" that my input is UTF-8.

>> Thing is, the serious text utilities like word processors probably
>> need lots of ancillary information so Python's [text] strings might
>> be too naive to represent even a single character.
>
> Ancillary information? (La)TeX files are entirely text,

I mean on the inside. For example, if emacs were to be written in
Python3, I don't know if it could use Python3's strings.

   Each character position in a buffer or a string can have a text property
   list, much like the property list of a symbol (see Property Lists). The
   properties belong to a particular character at a particular place, such
   as, the letter ‘T’ at the beginning of this sentence or the first ‘o’ in
   ‘foo’—if the same character occurs in two different places, the two
   occurrences in general have different properties. <URL:
   http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Prope
   rties.html>.

> What is C actually storing in that string? Do you know? Can you be
> truly sure that it's UTF-8? No, you cannot.

I happen to know it does. And again, I may "stipulate" it to use your
word.

Python, happily, is even more explicit about it:

   #!/usr/bin/env python3
   # -*- coding: utf-8 -*-

Marko