[I18n-sig] Re: Unicode debate

Guido van Rossum guido@python.org
Mon, 01 May 2000 14:14:48 -0400


Fredrik Lundh wrote:

> note that the existing Python language reference describes this
> model very clearly:
> 
>     [Sequences] represent finite ordered sets indexed
>     by natural numbers.
> 
>     The built-in function len() returns the number of
>     items of a sequence.
> 
>     When the length of a sequence is n, the index set
>     contains the numbers 0, 1, ..., n-1.
> 
>     Item i of sequence a is selected by a[i].
> 
>     An object of an immutable sequence type cannot
>     change once it is created.
> 
>     The items of a string are characters.
> 
>     There is no separate character type; a character is
>     represented by a string of one item.
> 
>     Characters represent (at least) 8-bit bytes.
> 
>     The built-in functions chr() and ord() convert between
>     characters and nonnegative integers representing the
>     byte values.
> 
>     Bytes with the values 0-127 usually represent the corre-
>     sponding ASCII values, but the interpretation of values is
>     up to the program.
> 
>     The string data type is also used to represent arrays
>     of bytes, e.g., to hold data read from a file. 
> 
> as I've pointed out before, I want this to apply to all kinds of
> strings in 1.6.  imo, the cleanest way to do this is to change
> the last three sentences to:
> 
>     The built-in functions chr() and ord() convert between
>     characters and nonnegative integers representing the
>     character codes.
> 
>     Character codes usually represent the corresponding
>     unicode characters.
> 
>     The 8-bit string data type is also used to represent arrays
>     of bytes, e.g., to hold data read from a file.

Again, you're being terse.  I'm not sure what you want to do here.  Do
you want chr() to return a Unicode string for argument values >= 256?
(Note that ord(u"\xffff") already returns 65535; I just notice that
ord(u"\777") returns 255 instead of 511, I consider this a bug.)

You have to understand that the reference documentation is sloppy with
the word "character" -- when I wrote that text, "character" and "byte"
were synonyms in my mind.

> the encodings debate has nothing to do with this model.

If this has nothing to do with the encodings debate, why is it in the
same thread?

Please elaborate.  (But please finish the next sre snapshot first! :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)