[Python-Dev] #pragmas in Python source code

Sat, 15 Apr 2000 12:14:54 +0200

> This is exactly the same as proposing to change the default
> encoding to Latin-1.

no, it isn't.  here's what I'm proposing:

-- the internal character set is unicode, and nothing but
   unicode.  in 1.6, this applies to strings.  in 1.7 or later,
   it applies to source code as well.

-- the default source encoding is "unknown"

-- the is no other default encoding.  all strings use the
   unicode character set.

to give you some background, let's look at section 3.2 of
the existing language definition:

    [Sequences] represent finite ordered sets indexed
    by natural numbers.

    The built-in function len() returns the number of
    items of a sequence.

    When the length of a sequence is n, the index set
    contains the numbers 0, 1, ..., n-1.

    Item i of sequence a is selected by a[i].

    An object of an immutable sequence type cannot
    change once it is created.

    The items of a string are characters.

    There is no separate character type; a character is
    represented by a string of one item.

    Characters represent (at least) 8-bit bytes.

    The built-in functions chr() and ord() convert between
    characters and nonnegative integers representing the
    byte values.

    Bytes with the values 0-127 usually represent the corre-
    sponding ASCII values, but the interpretation of values is
    up to the program.

    The string data type is also used to represent arrays
    of bytes, e.g., to hold data read from a file.=20

(in other words, given a string s, len(s) is the number of characters
in the string.  s[i] is the i'th character.  len(s[i]) is 1.  etc.  the
existing string type doubles as byte arrays, where given an array
b, len(b) is the number of bytes, b[i] is the i'th byte, etc).

my proposal boils down to a few small changes to the last three
sentences in the definition.  basically, change "byte value" to
"character code" and "ascii" to "unicode":

    The built-in functions chr() and ord() convert between
    characters and nonnegative integers representing the
    character codes.

    Character codes usually represent the corresponding
    unicode values.

    The 8-bit string data type is also used to represent arrays
    of bytes, e.g., to hold data read from a file.

that's all.  the rest follows from this.

...

just a few quickies to sort out common misconceptions:

> I don't have anything against that (being a native Latin-1
> user :), but I would assume that other native language
> writer sure do: e.g. all programmers not using Latin-1
> as native encoding (and there are lots of them).

the unicode folks have already made that decision.  I find it
very strange that we should use *another* model for the
first 256 characters, just to "equally annoy everyone".

(if people have a problem with the first 256 unicode characters
having the same internal representation as the ISO 8859-1 set,
tell them to complain to the unicode folks).

> (and this is not far fetched since there are input sources
> which do return UTF-8, e.g. TCL), the Unicode implementation
> will apply all its knowledge in order to get you satisfied.

there are all sorts of input sources.  major platforms like
windows and java use 16-bit unicode.

and Tcl has an internal unicode string type, since they
realized that storing UTF-8 in 8-bit strings was horridly
inefficient (they tried to do it right, of course).  the
internal type looks like this:

typedef unsigned short Tcl_UniChar;

typedef struct String {
    int numChars;
    size_t allocated;
    size_t uallocated;
    Tcl_UniChar unicode[2];
} String;

(Tcl uses dual-ported objects, where each object can
have an UTF-8 string representation in addition to the
internal representation.  if you change one of them, the
other is recalculated on demand)

in fact, it's Tkinter that converts the return value to
UTF-8, not Tcl.  that can be fixed.

> > Python doesn't convert between other data types for me, so
> > why should strings be a special case?
>=20
> Sure it does: 1.5 + 2 =3D=3D 3.5, 2L + 3 =3D=3D 5L, etc...

but that's the key point: 2L and 3 are both integers, from the
same set of integers.  if you convert a long integer to an integer,
it still contains an integer from the same set.

(maybe someone can fill me in here: what's the formally
correct word here?  set?  domain?  category?  universe?)

also, if you convert every item in a sequence of long integers to
ordinary integers, all items are still members of the same integer
set.

in contrast, the UTF-8 design converts between strings of
characters, and arrays of bytes.

unless you change the 8-bit string type to know about UTF-8,
that means that you change string items from one domain
(characters) to another (bytes).

> Note that you are again argueing for using Latin-1 as
> default encoding -- why don't you simply make this fact
> explicit ?

nope.  I'm standardizing on a character set, not an encoding.

character sets are mapping between integers and characters.
in this case, we use the unicode character set.

encodings are ways to store strings of text as bytes in a byte
array.

> not now, when everything has already been implemented and
> people are starting to the use the code that's there with great
> success.

the positive reports I've seen all rave about the codec frame-
work.  that's a great piece of work.  without that, it would have
been impossible to do what I'm proposing.  (so what are you
complaining about?  it's all your fault -- if you hadn't done such
a great job on that part of the code, I wouldn't have noticed
the warts ;-)

if you look at my proposal from a little distance, you'll realize
that it doesn't really change much.  all that needs to be done
is to change some of the conversion stuff.  if we decide to
do this, I can do the work for you, free of charge.

</F>