OT [Way OT]: Unicode Unification Objections

Sun May 7 16:07:28 EDT 2000

"Dennis E. Hamilton" wrote:

> Consider the following.  In Japanese texts, when a borrowed or employed
> Korean word is used, a desired practice is to render the Korean
> characters as different, even though some or all of them involve "the
> same character" common to both languages.  However, the iconography (or
> calligraphy) is commonly different.  This loses the ability to
> distinguish the linguistic use of the character, forcing material to be
> font-distinguished some how (e.g., give me the ones that look Korean,
> not the ones that look Japanese).  This means that the distinction can't
> be preserved in simple text.

The entire point of markup is that distinctions like this *shouldn't* be
preserved in simple text.

How a particular character should *look* is a question for presentation
glyphs, not for character encoding.  What kind of language a character
is used to represent (emphasis, quotations, foreign borrowings, titles,
proper names, sarcasm, etc.) is a question for markup, not for character
encoding.

A similar situation holds in almost every written language.  We often dump
Latin or French words into English text.  Even though they may be written
in the same alphabet, we usually want them to *look* different from
ordinary English words.  But nobody uses that fact to argue that Unicode 
should have handed out a separate code number to italic "e" than to plain
"e" -- or even that entirely disjoint sets of codes should be used for 
writing Latin, French, and English words.  (Actually, there probably were
people who argued these, but they were rightly ignored.)

Once you start mixing up glyph questions (what a character should look
like) and markup questions (what kind of language is the character being
used for) with character encoding questions, there's no logical place to
stop.  Soon the Unicode Consortium would be fielding requests like
"I want to be able to distinguish the Yamato pronunciation of the
'sun' character whispered with a Hokkaido accent and a lisp from
the Yamato pronunciation of the 'sun' character as muttered inaudibly
by someone from Okinawa who has a head-cold.  And I want to
do it with a character that looks like it was finger-painted in drippy
chartreuse water-colours by a five-year-old.  And it should be 23-point
bold, 'cause it will be in a title.  I suggest you assign me character
xDE4A921B."

>
> Unfortunately, the Greek alphabet and the APL alphabet (and apparently
> some other math symbol alphabets) *were* unified.  That is, a number of
> Greek-letter symbols were removed from any distinct APL character set,
> and only some APL-unique made-up symbols having Greek letters in them
> were retained as separate.  Unfortunately, the iconography of the Greek
> alphabet in Greek text is often enough different that those codes don't
> render appropriately with the other APL symbols when used in APL texts.
> Borrowing epsilon (for member) from the Greek character set in Unicode
> is not always what one wants to do when writing membership propositions
> in APL (and borrowing the alternative MEMBER OF symbol may not get you
> what you want either).  It's even more fun if you want to write APL
> programs and use Greek-language identifiers.  Something a CP4E teacher
> in a Greek school might strongly desire to do.  Get it?
>

What kind of sadist would try to teach a CP4E course using APL?!   
(And I say this fondly as someone who learned APL as my first 
computer language in high school.)

Actually, this is a very good point. But the problem is more with APL 
than with Unicode.  Every computer language needs a way to distinguish
characters used in literals from characters used in operators from
characters used as raw data (e.g., strings), and so on.  Usually the 
decisions made during syntax design will limit to programmer's freedom
to make up literal names. (Or to quote what I said at 90 decibels at 
2 a.m. two nights ago: "Why the **** can't I call that variable 'class' 
if I want to call it 'class'!)  Iverson's decisions during APL's syntax
design must have made perfect sense in 1960, but in hindsight they 
turned out to be extremely inconvenient.  If you want to extend APL 
to make it less inconvenient for your Greek teacher, you're either
going to have to redesign the language syntax from scratch (as Iverson 
did for J) or you could take the problematic APL characters and assign 
your own codes to them from Unicode's private use blocks.  That's what 
the private use blocks are for -- they let you and the four other people
in the universe who are going to use your Greek APL interpreter do what
you need to do, without bogging down the entire character standard.

-- Kevin Russell