[Python-Dev] Python Unicode

Paul Prescod paul@prescod.net
Wed, 26 Apr 2000 20:47:41 -0500


Fredrik Lundh wrote:
> 
> ...
>
> But alright, I give up.  I've wasted way too much time on this, my
> patches were rejected, and nobody seems to care.  Not exactly
> inspiring.

I can understand how frustrating this is. Sometimes something seems just
so clean and mathematically obvious that you can't see why others don't
see it that way.

A character is the "smallest unit of text."
Strings are lists of characters.
Characters in character sets have numbers.

Python users should never know or care whether a string object is an
8-bit string or a Unicode string. There should be no distinction. u""
should be a syntactic shortcut. The primary reason I have not been
involved is that I have not had a chance to look at the implementation
and figure out if there is an overriding implementation-based reason to
ignore the obvious right thing (e.g the right thing will break too much
code or be too slow or...). "Unicode objects" should be an
implementation detail (if they exist at all).

Strings are strings are strings. The Python programmer shouldn't care
about whether one string was read from a Unicode file and another from
an ASCII file and one typed in with "u" and one without. It's all the
same thing! 

If the programmer wants to do an explicit UTF-8 decode on a string
(whether it is Unicode or 8-bit string...no difference) then that decode
should proceed by looking at each character, deriving an integer and
then treating that integer as an octet according to the UTF-8
specification.

Char -> Integer -> Byte -> Char

The end result (and hopefully the performance) would be the same but the
model is much, much cleaner if there is only one kind of string. We
should not ignore the example set by every other language (and yes, I'm
including XML here :) ). I'm as desperate (if not as vocal) as Fredrick
is here.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html