[I18n-sig] Pre-PEP: Proposed Python Character Model

Paul Prescod paulp@ActiveState.com
Tue, 06 Feb 2001 17:12:43 -0800


"Martin v. Loewis" wrote:
> 
> ...
> I disagree. There should be a character string type and a byte string
> type, at least. I would agree that a single character string type is
> desirable.

It depends on whether we decide to talk about "byte strings" or "byte
arrays".

> >     type("") == type(chr(150)) == type(chr(1500)) == type(file.read())
> 
> I disagree. For the last one, much depends on what file is. If it is a
> byte-oriented file, reading from it should not return character
> strings.

I don't think that there should be such a thing as a byte-oriented
file...but that's a pretty small detail.

I think that the result of the read() function should be consistently a
character string and not different from one type of file object to
another...getting a byte array/string/thing should be a seperate method.

> >     2. It should be easier and more efficient to encode and decode
> >        information being sent to and retrieved from devices.
> 
> I disagree. Easier, maybe; more efficient - I don't think Python is
> particular inefficient in encoding/decoding.

Once I have a file object, I don't know of a way to read unicode from it
without reading bytes and then decoding into another string...but I may
just not know that there is a more efficient way.

> Sure it is possible. Different character sets (in your terminology)
> have common characters, which is a phenomenon that your definition
> cannot describe. Mathematically speaking, there is an unlimited domain
> CHAR (the set of all characters), 

CHAR is not a useful set in a computer science sense because if items
from it are addressable or comparable then there exists an ord()
function. Therefore there is a character set. If the items are not
addressable or comparable then how would you make use of it?

We could argue about the platonic truth embedded in the word "character"
but I think that's a waste of time.

> More generally, it is a mechanism for representing character sequences
> in terms of bit sequences. Otherwise, you can not cover the phenomenon
> that the encoding of a string is not the concatenation of the
> encodings of the individual characters in some encodings.
> 
> Also, this term is often called "coded character set" (CCS).

Fair enough.

> >         Similarly a Python programmer does not need to know or care
> >         how characters are represented in memory. We might even
> >         change the representation over time to achieve higher
> >         performance.
> 
> Programmers need to know the character set, at a minimum. Since you
> were assuming that you can't have characters without character sets, I
> guess you've assumed that as implied.

The whole point of these two sections is that programmers should care
alot about the character set and not at all about its in-memory
representation.

> >     Universal Character Set
> >
> >         There is only one standardized international character set that
> >         allows for mixed-language information.
> 
> Not true. E.g. ISO 8859-5 allows both Russian and English text,
> ISO 8859-2 allows English, Polish, German, Slovakian, and a few
> others. 

If you want to use a definition of "international" that means "European"
then I guess that's fair. But you don't say you've internationalized a
computer program when you've added support for the Canadian dollar along
with the American one. :)

> ISO 2022 (and by reference all incorporated character sets)
> supports virtually all existing languages.

I do not believe that ISO 2022 is really considered a character set.

> >         A popular subset of the Universal Character Set is called
> >         Unicode. The most popular subset of Unicode is called the "Unicode
> >         Basic Multilingual Plane (Unicode BMP)".
> 
> Isn't the BMP the same as Unicode, as it is the BMP (i.e. group 0,
> plane 0) of ISO 10646?

No, Unicode has space for 16 planes:

UTF-16 extra planes (to be filled by Unicode 4 and ISO-10646-2) 
	Non-Han Supplementary Plane 1: {U-00010000..U-0001FFFF} 
	Etruscan: {U-00010200..U-00010227} 
	Gothic: {U-00010230..U-0001024B} 
	Klingon: {U-000123D0..U-000123F9} 
	Western Musical Symbols: {U-0001D103..U-0001D1D7} 
	Han Supplementary Plane 2: {U-00020000..U-0002FFFF} 
	Reserved Planes 3..13: {U-00030000..U-000DFFFF} 
	Plane 14: {U-000E0000..U-000EFFFF} 
	Language Tag Characters: {U-000E0000..U-000E007F}
	Private Use Planes: {U-000F0000..U-0010FFFF} 

> >             Java
> >         It is the author's belief this "running code" is evidence of
> >         Unicode's practical applicability.
> 
> At least in the case of Java, I disagree. It very much depends on the
> exact version of the JVM that you are using, but I had the following
> problems:

I'm not saying that any particular Unicode-using system is perfect. I'm
saying that they work. I don't think that Java would work better if it
used something other than Unicode.

> Sure. Code that treats character strings as if they are byte strings
> will break.

We've discussed this further and I think I may yet convince you
otherwise...

> >     This means that Unicode literals and escape codes can also be
> >     merged with ordinary literals and escape codes. unichr can be merged
> >     with chr.
> 
> Not sure. That means that there won't be byte string literals. It is
> particular worrying that you want to remove the way to get the numeric
> value of a byte in a byte string.

I don't recall suggesting any such thing! chr() of a byte string should
return the byte value. chr() of a unicode string should return the
character value.

> Are you saying that byte strings are visible to the average programmer
> in rare circumstances only? Then I disagree; byte strings are
> extremely common, as they are what file.read returns.

Not under my proposal. file.read returns a character string. Sometimes
the character string contains characters between 0 and 255 and is
indistinguishable from today's string type. Sometimes the file object
knows that you want the data decoded and it returns large characters.

> >     Unfortunately, there is not one, single, dominant encoding. There are
> >     at least a dozen popular ones including ASCII (which supports only
> >     0-127), ISO Latin 1 (which supports only 0-255), others in the ISO
> >     "extended ASCII" family (which support different European scripts),
> >     UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by
> >     Java and Windows), Shift-JIS (preferred in Japan) and so forth. This
> >     means that the only safe way to read data from a file into Python
> >     strings is to specify the encoding explicitly.
> 
> Note how you are mixing character sets and encodings here. As you had
> defined earlier, a single character set (such as US-ASCII) can have
> multiply encodings (e.g. with checksum bit or without).

I believe that ASCII is both a character set and an encoding. If not,
what is the name for the encoding we've been using prior to Unicode?

> >     Any code that does I/O should be changed to require the user to
> >     specify the encoding that the I/O should use. It is the opinion of
> >     the author that there should be no default encoding at all.
> 
> Not sure. IMO, the default should be to read and write byte strings.

The default for current Python code, yes. The default going forward? We
could debate that.

> Sounds good. Note that the proper way to write this is

We need a built-in function that everyone uses as an alternative to the
byte/string-ambiguous "open".

>    fileobj = codecs.open("foo", "r", "ASCII")
>    # etc
> 
> >         fileobj2.encoding = "UTF-16" # changed my mind!
> 
> Why is that a requirement. In a normal stream, you cannot change the
> encoding in the middle - in particular not from Latin 1 single-byte to
> UTF-16.

What is a "normal stream?" Python must be able to handle all streams,
right? I can imagine all kinds of pickle-like or structured stream file
formats that switch back and forth between binary information, strings
and unicode. I'd rather not require our users to handle these in
multiple passes.

BTW, you only know the encoding of an XML file after you've read the
first line...

> Disagree. If a file is open for reading characters, reading bytes from
> the middle is not possible. If made possible, it won't be more efficient,
> as you have to keep track of the encoder's state. Instead, the right way
> to write this is
> 
>      fileobj2 = open("bar", "rb")
>      moredata = fileobj2.read(1024)

I disagree on many levels...but I'm willing to put off this argument.

> ...
> >     #?encoding="UTF-8"
> >     #?encoding="ISO-8859-1"
> 
> The specific syntax may be debatable; I dislike semantics being put in
> comments. There should be first-class syntax for that. Agree on the
> principle approach.

We need a backwards-compatible syntax...

> >     Python already has a rule that allows the automatic conversion
> >     of characters up to 255 into their C equivalents.
> 
> If it is a character (i.e. Unicode) string, it only converts 127
> characters in that way.

Yes, this is an annoying difference. But I was talking about *Python
strings* not Unicode strings.

> >         Ordinary string literals should allow large character escape codes
> >         and generate Unicode string objects.
> 
> That is available today with the -U option. I'm -0 on disallowing byte
> string literals, as I don't consider them too important.

I don't know what you mean by disallowing byte string literals.

If I type:

a="abcdef"

Python is ambiguous whether this is a character string literal or a byte
string literal. I'm planning on interpreting it as a character string
literal. That's just a definitional thing and it doesn't break anything
or remove anything. It doesn't even hurt if you use escapes to embed
nulls or other control characters. Unicode character equivalents exist
for all of them.

> >         The format string "S" and the PyString_AsString functions should
> >         accept Unicode values and convert them to character arrays
> >         by converting each value to its equivalent byte-value. Values
> >         greater than 255 should generate an exception.
> 
> Disagree. Conversion should be automatic only up to 127; everything
> else gives questionable results.

This is a fundamental disagreement that we will have to work through.
What is "questionable" about interpreting a unicode 245 as a character
245? If you wanted UTF-8 you would have asked for UTF-8!!!

> >         fopen should be like Python's current open function except that
> >         it should allow and require an encoding parameter.
> 
> Disagree. This is codec.open.

code.open will never become popular.

> >     Python needs to support international characters. The "ASCII" of
> >     internationalized characters is Unicode. Most other languages have
> >     moved or are moving their basic character and string types to
> >     support Unicode. Python should also.
> 
> And indeed, Python does today. I don't see a problem *at all* with the
> structure of the Unicode support in Python 2.0. As initial experiences
> show, application *will* need to be modified to take Unicode into
> account; I doubt that any enhancements will change that.

Let's say you are a Chinese TCL programmer. If you know the escape code
for a Kanji character you put it in a string literal just as a Westerner
would do. 

The same Chinese Python programmer must use a special syntax of string
literal and the object he creates has a different type and lots and lots
of trivial, otherwise language-agnostic code crashes because it tests
for type("") when it could handle large character codes without a
problem.

I see this as a big problem...

 Paul Prescod