[I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model

Wed, 07 Feb 2001 10:35:53 +0000

On Tue, 06 Feb 2001 10:27:10 -0800, Paul Prescod
<paulp@ActiveState.com> wrote:

>"M.-A. Lemburg" wrote:
>>=20
>> ...
>>=20
>> Unicode is the defacto international standard for unified
>> script encodings. Discussing whether Unicode is good or bad is
>> really beyond the scope of language design and should be dealt
>> with in other more suitable forums, IMHO.
>
>We are in violent agreement.
>
>>...
>>=20
>> I don't understand your statement about allowing string objects
>> to support "higher" ordinals... are you proposing to add a third
>> character type ?
>
>Yes and no. I want to make a type with a superset of the functionality
>of strings and Unicode strings.
>
>> > Similarly, we could improve socket objects so that they have =
different
>> > readtext/readbinary and writetext/writebinary without unifying the
>> > string objects. There are lots of small changes we can make without
>> > breaking anything.=20
>
>Before we go on: do you agree that we could add fopen and
>readtext/readbinary on various I/O types without breaking anything?

>And
>that that we should do so?

I dislike the idea of burdening the file object interface with
separate functions for binary and text IO, and a way of changing the
encoding. There are many other types/classes that support the file
interface, and I think it is desirable to support text IO on all of
them.

The wrapper approach from the codecs module seems better, since it can
be used to convert any byte file into a text file.

Also consider a hypothetical new storage device that stores unicode
natively: how should it implement readbytes?

(however, an implicit 'import codecs.open as fopen' may make sense)

>> > One I would like to see right now is a unification of
>> > chr() and unichr().
>>=20
>> This won't work: programs simply do not expect to get Unicode
>> characters out of chr() and would break.=20
>
>Why would a program pass a large integer to chr() if it cannot handle
>the resulting wide string????
>
>> OTOH, programs using
>> unichr() don't expect 8bit-strings as output.

We can unify these two only if we change the default encoding from
ASCII to latin1, otherwise:

Python 2.0 (#6, Oct  6 2000, 15:49:48) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
>>>
>>> u'\310'+unichr(200)
u'\310\310'
>>> u'\310'+chr(200)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)

The counter-argument from last time around was that this will do the
wrong thing for anyone mixing unicode objects with plain strings
containing non-latin1 content. This argument goes away once there is
only one type used for storing text.

Toby Dickenson
tdickenson@geminidataloggers.com