[Python-Dev] Unicode howto in the works - feedback appreciated

Stephen J. Turnbull stephen@xemacs.org
01 May 2002 16:52:13 +0900


>>>>> "Skip" == Skip Montanaro <skip@pobox.com> writes:

    Skip> I began working on a Unicode HOWTO a few weeks ago, got a
    Skip> little ways on it, then ignored it until this morning.  I
    Skip> added a little bit more to it then decided I should get some
    Skip> feedback.  You can view it at

    Skip>     http://www.musi-cal.com/~skip/unicode/

Thanks!

A few comments.

Overall, I like this intro.  Technically it's horrible<wink> but I
think it will hit your target audience where they live.

[What Is Unicode?]

1.  Characters are "atomic units of text" that have properties.  Since
    they're atoms, we represent them by integers in computer programs.
    Among the properties are their glyphs (graphical representation),
    classes (alpha, num, whitespace, etc), and so on.  It is a bad
    idea to identify characters with their glyphs.

2.  Alphabets are abstract sets of characters.  Coded character sets
    map characters to integer representations.  "Encoding" is a
    reasonable synonym for "coded character set".  Avoid "charset"
    except when talking about the charset parameter of Content-Type.

3.  Typo in last sentence "I will suggest that YOU should use UTF-8."

[Why UTF-8?]

1.  Most programming languages are restricted to ASCII, except perhaps
    for user-defined identifiers.  This means that programming tools
    need only be 8-bit clean to handle UTF without corruption.[1]

2.  Space efficiency is _not_ an advantage of UTF-8 vs. UTF-16.  ASCII
    and most Western European languages, yes.  Greek, Hebrew, Arabic
    or Russian will be nearly a wash (whitespace, punctuation, and
    numerals give you what savings you're gonna get), and everybody
    east of Eden takes a 50% hit.  The real tradeoff is "string ==
    array of fixed-width object" semantics[2] vs upward compatibility from
    ASCII for languages where most tokens contain only ASCII.

[Email]

1.  If you don't get a Content-Type charset parameter, you _must_ assume
    US-ASCII.

[Mildly Corrupt Data]

1.  You can expect people to develop libraries for this kind of thing,
    but they are unlikely to be distributed.  Suggest that newbies ask
    around.

Footnotes: 
[1]  This isn't quite true; consider the Lisp ?A notation for
character literals.  A naive byte-oriented parser will pick up only
the leading byte of a non-ASCII UTF-8 character, and probably choke
fatally on the trailing bytes.  But Python, C, Java, et al don't have
such literals---tokens with delimiters that are ASCII characters are
safe, both strings and identifiers.  You can ignore this issue.

[2]  Which UTF-16 actually doesn't give you!  Grrr.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
 My nostalgia for Icon makes me forget about any of the bad things.  I don't
have much nostalgia for Perl, so its faults I remember.  Scott Gilbert c.l.py