[Python-Dev] Unicode debate

Paul Prescod paul@prescod.net
Thu, 27 Apr 2000 21:20:22 -0500


Guido van Rossum wrote:
> 
> ...
>
> I've heard a few people claim that strings should always be considered
> to contain "characters" and that there should be one character per
> string element.  I've also heard a clamoring that there should only be
> one string type.  You folks have never used Asian encodings.  In
> countries like Japan, China and Korea, encodings are a fact of life,
> and the most popular encodings are ASCII supersets that use a variable
> number of bytes per character, just like UTF-8.  Each country or
> language uses different encodings, even though their characters look
> mostly the same to western eyes.  UTF-8 and Unicode is having a hard
> time getting adopted in these countries because most software that
> people use deals only with the local encodings.  (Sounds familiar?)

I think that maybe an important point is getting lost here. I could be
wrong, but it seems that all of this emphasis on encodings is misplaced.

The physical and logical makeup of character strings are entirely
separate issues. Unicode is a character set. It works in the logical
domain.

Dozens of different physical encodings can be used for Unicode
characters. There are XML users who work with XML (and thus Unicode)
every day and never see UTF-8, UTF-16 or any other Unicode-consortium
"sponsored" encoding. If you invent an encoding tomorrow, it can still
be XML-compatible. There are many encodings older than Unicode that are
XML (and Unicode) compatible.

I have not heard complaints about the XML way of looking at the world
and in fact it was explicitly endorsed by many of the world's leading
experts on internationalization. I haven't followed the Java situation
as closely but I have also not heard screams about its support for il8n. 

> The truth of the matter is: the encoding of string objects is in the
> mind of the programmer.  When I read a GIF file into a string object,
> the encoding is "binary goop".  

IMHO, it's a mistake of history that you would even think it makes sense
to read a GIF file into a "string" object and we should be trying to
erase that mistake, as quickly as possible (which is admittedly not very
quickly) not building more and more infrastructure around it. How can we
make the transition to a "binary goops are not strings" world easiest?

> The moral of all this?  8-bit strings are not going away.  

If that is a statement of your long term vision, then I think that it is
very unfortunate. Treating string literals as if they were isomorphic
with byte arrays was probably the right thing in 1991 but it won't be in
2005.

It doesn't meet the definition of string used in the Unicode spec., nor
in XML, nor in Java, nor at the W3C nor in most other up and coming
specifications.

From the W3C site:

""While ISO-2022-JP is not sufficient for every ISO10646 document, it is
the case that ISO10646 is a sufficient document character set for any
entity encoded with ISO-2022-JP.""

http://www.w3.org/MarkUp/html-spec/charset-harmful.html

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html