[XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate (Guido van Rossum)

Tony Graham tgraham@mulberrytech.com
Tue, 2 May 2000 14:18:13 -0400 (EST)


I subscribe to the Digest, so I'm a bit behind...

At 2 May 2000 11:56 -0400, xml-sig-request@python.org wrote:
 > From: Guido van Rossum <guido@python.org>
 > Date: Mon, 01 May 2000 17:32:38 -0400
 > Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
...
 > I have a bunch of good reasons (I think) for liking UTF-8: it allows
 > you to convert between Unicode and 8-bit strings without losses, Tcl

UTF-8 is variable-length 8-bit encoding of Unicode characters.  The
only characters that cleanly convert between UTF-8 and fixed-length
8-bit strings are the ASCII characters.

 > uses it (so displaying Unicode in Tkinter *just* *works*...), it is
 > not Western-language-centric.

UTF-8 is Western-language-centric.  In fact, it's practically
English-centric since only the ASCII characters are 1 byte per
character, the characters for writing most European languages plus
Arabic and Hebrew are 2 bytes per character, and the rest -- including
Hangul and the CJK ideographs -- are 3 bytes per character.  Japanese
text files, for example, are 50% larger as UTF-8 text than as UTF-16
text.

 > Another reason: while you may claim that your (and /F's, and Just's)
 > preferred solution doesn't enter into the encodings issue, I claim it
 > does: Latin-1 is just as much an encoding as any other one.
 > 
 > I claim that as long as we're using an encoding we might as well use
 > the most accepted 8-bit encoding of Unicode as the default encoding.

There have been other proposals for variable-length 8-bit
transformation formats of Unicode characters, but UTF-8 is the only
one that is specified in the Unicode Standard and ISO/IEC 10646.

There is less hassle with characters outside the 16-bit Basic
Multilingual Plane (BMP) with UTF-8 than with, for example, UTF-16.
When working with UTF-8, you have to consider that all characters are
encoded as varying numbers of bytes.  When working with UTF-16, it's
easy to assume that all characters are 16-bit and write your code
accordingly, but there will shortly be characters defined outside of
the BMP -- including math characters used in MathML and new but
essential CJK ideographs -- so you have to work with UTF-16 data as
being "16-bit except when it isn't".

It shouldn't matter what encoding or transformation format is used for
the internal representation of strings.  Python should be able to read
and write files in a number of encodings so that it plays well with
others.  I compared eight languages in the "Programming Language
Support" chapter of "Unicode: A Primer" (ISBN: 0-7645-4625-2) and
found that there was no Unicode encoding that all eight languages
could read and write.  Playing well with others also means reading and
writing whatever non-Unicode encoding a user keeps his data in.

Python should also be able to read Python programs in a number of
encodings, including UTF-8 and UTF-16, plus it should include a
mechanism for referencing Unicode characters by number (or name)
within strings.

 > I also think that the issue is blown out of proportions: this ONLY
 > happens when you use Unicode objects, and it ONLY matters when some
 > other part of the program uses 8-bit string objects containing
 > non-ASCII characters.  Given the long tradition of using different
 > encodings in 8-bit strings, at that point it is anybody's guess what
 > encoding is used, and UTF-8 is a better guess than Latin-1.

Given the long tradition of using different encodings in 8-bit
strings, surely there's no safe assumption about the encoding in any
8-bit string?  ISO 8859-1 (Latin-1) is being superseded by ISO 8859-15
(which shuffled a few things and added the euro); Windows' CP 1252
isn't really ISO 8859-1 despite how some mailers and HTML editors
label it; and even I've processed multi-byte Japanese, Chinese, and
Korean text using 8-bit scripting languages.

Perl, for example, has "byte" and "utf8" pragmata for controlling
whether strings are treated as fixed-length 1-byte characters or as
variable-length UTF-8 characters, with the current default being
"byte".

Tcl, to use another example, can read and write files in a number of
encodings, but it defaults to using the system encoding, or ISO 8859-1
if it can't determine the system encoding.

Python, similarly, should not make assumptions about the encoding used
in strings in existing programs and should be flexible in supporting
the encodings that people do use.

Regards,


Tony Graham
======================================================================
Tony Graham                            mailto:tgraham@mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================