Python 3.2 has some deadly infection

Marko Rauhamaa marko at pacujo.net
Thu Jun 5 12:52:22 EDT 2014


Steven D'Aprano <steve+comp.lang.python at pearwood.info>:

> Nevertheless, there are important abstractions that are written on top
> of the bytes layer, and in the Unix and Linux world, the most
> important abstraction is *text*. In the Unix world, text formats and
> text processing is much more common in user-space apps than binary
> processing.

That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.

It is great that lots of computer-to-computer formats are encoded in
ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction
layer that processes Python-esque text.

Case in point:

   $ env | grep UTF
   LANG=en_US.UTF-8
   $ od -c <<<"Hyvää yötä"     # "Good night" in Finnish
   0000000   H   y   v 303 244 303 244       y 303 266   t 303 244  \n
   0000017

The "od" utility is asked to display its input as characters. The locale
info gives a hint that all text data is in UTF-8. Yet what comes out is
bytes.

How about:

   $ wc -c <<<"Hyvää yötä"
   15
   $ tr 'ä' 'a' <<<"Hyvää yötä"
   Hyvaaaa ya�taa

Grep is smarter:

   $ grep v...y <<<"Hyvää yötä"
   Hyvää yötä

which is why you should always prefix "grep" with LC_ALL=C in your
scripts (makes it far faster, too).


Marko



More information about the Python-list mailing list