Python 3 is killing Python

Wed Jul 16 10:04:41 EDT 2014

On Wed, Jul 16, 2014 at 11:11 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
>
>> With a few exceptions, /etc is filled with text files, not binary
>> files, and half the executables on the system are text (Python, Perl,
>> bash, sh, awk, etc.).
>
> Our debate seems to stem from a different idea of what text is. To me,
> text in the Python sense is a sequence of UCS-4 character code points.
> The opposite of text is not necessarily binary.

Let's shift things a moment for an analogy. What is audio? What is
sound? (Music, if you like, but I'm not going to get into the debate
of whether or not Band So-and-so's output should be called music.) I
have a variety of files that store music; some are RIFF WAVs, some are
MPEG level 3s, some are Ogg Vorbis files, and right now I have an MKV
of "Do you wanna build a snowman?" playing. (As far as I'm concerned,
it's primarily there for music, and the video image is buried behind
other windows. But I'll accept the argument that that's just a
container for some other format of audio, probably MPEG but I haven't
checked.) Sound, fundamentally, is a waveform, or a series of air
pressures.

Text, similarly, is not UCS-4, but a series of characters. We are
fortunate enough to have Unicode and can therefore define that text is
a sequence of Unicode codepoints, but the distinction isn't a feature
of Unicode; if you ask a primary school child to identify the letters
in a word, s/he should be able to do so, and that without any computer
involvement at all. Letters, digits, and other characters exist
independently of encodings or even character sets, but it's really
REALLY hard for computers to manipulate what they can't identify. So
let's define Unicode text as "a sequence of Unicode codepoints" or "a
sequence of Unicode characters", and proceed from there.

A file on a Unix or Windows file system consists of a sequence of
bytes. Ergo, a file cannot actually contain text; it must store
*encoded* text. But this is far and away the most common type of file
on any file system. Tweaking the previous script to os.walk() my home
directory, rather than scanning $PATH, the ratios are roughly 2:1 the
other way - heaps more text files than binary. And this is with my
Downloads/ directory being almost entirely binaries, and lots of them;
various zip files, deb packages, executables of various types... about
the only actual text there would be .patch files.

>> Relatively rare. Like, um, email, news, html, Unix config files,
>> Windows ini files, source code in just about every language ever,
>> SMSes, XML, JSON, YAML, instant messenger apps,
>
> I would be especially wary of letting Python 3 interpret those files for
> me. Python's [text] strings could be a wonderful tool on the inside of
> my program, but I definitely would like to micromanage the I/O. Do I
> obey the locale or not? That's too big (and painful) a question for
> Python to answer on its own (and pretend like everything's under
> control).

That's a problem that will be solved progressively, by daemons
shifting to UTF-8 for everything. But until then, you have to treat
log files as "messy" - you can't trust to a simple encoding. But
that's unusual compared to the common case. If you're reading your own
config files, you can simply stipulate that they are to be encoded
UTF-8, and if they're not, you throw an error. Simple! Works with the
easy way of opening files in Python. If you're reading someone else's
config files, you can either figure out what that program is
documented as expecting (and error out if the file's misencoded), or
treat it as messy and read it as binary.

>> word processors... even *graphic* applications invariably have a text
>> tool.
>
> Thing is, the serious text utilities like word processors probably need
> lots of ancillary information so Python's [text] strings might be too
> naive to represent even a single character.

Ancillary information? (La)TeX files are entirely text, and have all
that info in them somewhere. Open Documents are basically zip files of
XML data, where XML is ... all text. Granted, it's barely-readable
text, but it is UTF-8 encoded text. (I just checked an odt file that I
have sitting here, and it does contain a thumbnail in PNG format. But
the primary content is all XML files.)

>>> More often, len(b'λ') is what I want.
>>
>> Oh really? Are you sure? What exactly is b'λ'?
>
> That's something that ought to work in the UTF-8 paradise.
> Unfortunately, Python only allows ASCII in bytes. ASCII only! In this
> day and age! Even C is not so picky:
>
>    #include <stdio.h>
>
>    int main()
>    {
>        printf("Hyvää yötä\n");
>        return 0;
>    }

And I have a program that lets me store 1.75 in an integer variable!
That's ever so much better than most programs. It's so much less
picky!

 Actually, Python allows all bytes in a bytestring, not just ASCII.
However, b'λ' has no meaning; in fact, even b'asdf' is dubious, and
this kind of notation exists only because there are many file formats
that mix ASCII text and binary data. To be truly accurate, b'asdf'
ought to be written as x'61736466' or something, because it's as
likely to mean 1634952294 or 1717859169 as it is to mean "asdf".

What is C actually storing in that string? Do you know? Can you be
truly sure that it's UTF-8? No, you cannot. Anyone might transcode
your source file, and I don't think C compilers are aware of character
literals and their associated encodings. More importantly, you cannot
be sure that that will print "Hyvää yötä" to the console; if the
console is set to an encoding other than the one your source file was
using, you'll get mojibake. With Python, at least the interpreter gets
some idea of what's going on.

ChrisA