[I18n-sig] Re: Unicode debate

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Sat, 29 Apr 2000 16:52:30 +0200


Paul Prescod wrote:
> > > I think that maybe an important point is getting lost here. I =
could be
> > > wrong, but it seems that all of this emphasis on encodings is =
misplaced.
> >=20
> > In practical applications that manipulate text, encodings creep up =
all
> > the time. =20
>=20
> I'm not saying that encodings are unimportant. I'm saying that that =
they
> are *different* than what Fredrik was talking about. He was talking
> about a coherent logical model for characters and character strings
> based on the conventions of more modern languages and systems than
> C and Python.

note that the existing Python language reference describes this
model very clearly:

    [Sequences] represent finite ordered sets indexed
    by natural numbers.

    The built-in function len() returns the number of
    items of a sequence.

    When the length of a sequence is n, the index set
    contains the numbers 0, 1, ..., n-1.

    Item i of sequence a is selected by a[i].

    An object of an immutable sequence type cannot
    change once it is created.

    The items of a string are characters.

    There is no separate character type; a character is
    represented by a string of one item.

    Characters represent (at least) 8-bit bytes.

    The built-in functions chr() and ord() convert between
    characters and nonnegative integers representing the
    byte values.

    Bytes with the values 0-127 usually represent the corre-
    sponding ASCII values, but the interpretation of values is
    up to the program.

    The string data type is also used to represent arrays
    of bytes, e.g., to hold data read from a file.=20

as I've pointed out before, I want this to apply to all kinds of
strings in 1.6.  imo, the cleanest way to do this is to change
the last three sentences to:

    The built-in functions chr() and ord() convert between
    characters and nonnegative integers representing the
    character codes.

    Character codes usually represent the corresponding
    unicode characters.

    The 8-bit string data type is also used to represent arrays
    of bytes, e.g., to hold data read from a file.

the encodings debate has nothing to do with this model.

...

more later.  gotta run.

</F>