Python's 8-bit cleanness deprecated?

Fri Feb 7 18:26:17 EST 2003

Simo Salminen wrote:
> * Kirill Simonov [Fri, 7 Feb 2003 18:39:56 +0200]
>>...But what is the price that we pay for this? The millions of Python
>>scripts that use 8-bit string literals or comments are broken now in
>>order to allow the feature that no one ever used! I think that this is
>>an extreme.
> ...
> This change only makes python hostile to regular programmer, who
> does not care about encodings, and only wants to use simple 8-bit
> characters in comments.

I told myself to be quiet, but ....

This change is one step on the way to switching python source
from bytes to characters; from binary source to text source.

Unix users often think there is no difference between binary and
text files: the two are different, but on unix the representation
is the same.  That is, the text file consists of characters (which
have particular meanings), while the binary files are simply byte
streams which can only be replicated.  Html files are not the same
as text files either, although they are represented as text files.
The difference is in what you know about the contents of the file.
If I know a file is html, I can display it with a browser and see
nifty effects like bold, italic, type size changes, ....   Without
that information, I have less knowledge about how I might be able
to use that file.

Conceptually, source code is text, not bytes.  Nobody really cares
how the characters in a line are encoded: the meaning is apparent
by looking at displayed characters.  Unfortunately, we are now (as
we always were) in a world where there are multiple encodings for
the same characters.  On any given computer system, for a particular
user, there is a text encoding they are most comfortable using.
THis preference is usually because their favorite text editor can
read and write that encoding, and it has all of the characters they
are likely to use.

There are various options for python source:

First, we could define the coding to be 'system local,' and
endure constant complaints when a file that works right on one
system (or even for one user) does not behave in the same way
for another.  This is the "plain old 8-bit" option.

Second, we could (as I understand Python was conceived) restrict
python to a 7-bit printable ASCII plus space, horizontal tab, and
(? \n, \r, \r\n).  By the by, if you think the last is nit-picking,
exactly which bytes are in the constant: """a
z""".  The answer may depend on your operating system, or it may
not.  You probably can run python programs shipped (as binary
files) from Mac OS X, Microsoft Windows, and Unix systems.  The
results might differ.  Pretty much anyone who uses more characters
than are available in ASCII are going to be infuriated by this
choice.

Third, we could declare a single encoding as "the blessed" encoding
for python source.  This would be perfect for the winners and nasty
for the losers.  One group would love "latin1" to be the code.  Well,
UTF-8 at least has the pleasant property of being able to represent
the vast majority of characters representable on computers.  So
UTF-8 might be a good choice.  However, with a few exceptions, text
editors on a system work well with a particular local encoding.  Only
in a very few cases is this a variant of unicode.  So on many systems
people who use python will be forced to use a different text editor
than they normally use.

Fourth, we could define python in terms of characters, and allow
locally-encoded text to be be used as long as we know the mapping
from the local code to some standard (say, unicode).  It is likely
that such python translators will consist of a thin sugary coating
of local-encoding-capable code over a chocolatey core of standard
python code to do the actual parsing, compiling, etc.  The core, in
order to be most portable, is likely to munch on unicode.  We'll
also need a sugary layer that knows how to determine which original
bytes of local encoding to use in such things as non-unicode string
constants (note the unicode strings will be just dandy as-is).  This
looks a _lot_ like the first case, but allows local text encoding
that doesn't map ASCII to the ASCII subset of unicode.  This is also
the first character-based option.

Fifth, and I personally choose to take the fifth, we could use the
fourth option, except we could write the encoding at the front of
the file (oops, unix uses the first line to control certain program
behavior, lets allow the forst _or_ the second line).  If this works,
not only does it work as well as the fourth option, but we can
actually use modules developed under another encoding on our system
without ever having to push them through some sort of "try to do
what they mean" translator to get it into our local format.  This
option uses character-based source code with explicit encoding to
allow us to run python from anywhere locally.  _But_, it requires
we be explicit about encodings.  Our translator will cope properly
with a program built from modules in different encodings, _but_it_
_must_know_the_encodings_.  This is delightful, since now we can
have a code repository where we can pull contributed code written
in Brazil, Serbia, Kyoto, and Thailand from a single repository
safely.  The sole cost is explicit encoding.  We could probably
even cope with EBCDIC, were someone lusting to use old character
codes, since we need to only look at the first two lines.  If
we cannot find an encoding in the first two lines looking at
simple ASCII, we try as EBCDIC and look to see if we find it.
If not, we then try big5 and ....

Roman Suzi asked:
     "how one would feel if '# -*- coding: ascii -*-' would be
      necessary for every program?"
I replied,
     "I would probably never use it.  If I had to use an encoding,
      I would probably use: '# -*- coding: UTF-8 -*-', since I could
      encode other author's names in comments (or credit strings)."

I really have no idea whether I am mentioning issues here that
people don't realize, or simply spouting off my opinions to a
group that finds them unconvincing.  I, of course, hope to be doing
the former and will resume my silence for fear that I am doing the
latter.

-Scott David Daniels