[I18n-sig] Re: Unicode debate

Thu, 27 Apr 2000 16:50:53 -0700

> Christopher Petrilli petrilli@amber.org <mailto:petrilli%40amber.org>

>> Guido van Rossum [guido@python.org <mailto:guido@python.org>] wrote:
>> I've heard a few people claim that strings should always be considered
>> to contain "characters" and that there should be one character per
>> string element.  I've also heard a clamoring that there should only be
>> one string type.  You folks have never used Asian encodings.  In
>> countries like Japan, China and Korea, encodings are a fact of life,
>> and the most popular encodings are ASCII supersets that use a variable
>> number of bytes per character, just like UTF-8.  Each country or
>> language uses different encodings, even though their characters look
>> mostly the same to western eyes.  UTF-8 and Unicode is having a hard
>> time getting adopted in these countries because most software that
>> people use deals only with the local encodings.  (Sounds familiar?)

> Actually a bigger concern that we hear from our customers in Japan is
> that Unicode has *serious* problems in asian languages.  Theey took
> the "unification" of Chinese and Japanese, rather than both, and
> therefore can not represent los of phrases quite right.  I can have
> someone write up a better dscription, but I was told by several
> Japanese people that they wouldn't use Unicode come hell or high
> water, basically.

Yeah, not all of the east asian ideographs are availble in Unicode atm. :(
Currently there are two pending extensions to the unified CJK ideographs.
Extension A is slated as part of the BMP. 0x0000 - 0xAAFF in Plane 2 is
currently slated for use by Extension B.  
BMP Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2213.pdf
Plane 2 Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2215.pdf

On top of which is there is this serious problem of end user defined
characters in a number of these MBCS encodings. 

Win32 OSs handles mapping these characters into Unicode in the following
way:
In the Win32 registry at:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\EUDCCodeRan
ge

There exists several REG_SZ registry values. The names of the values are
MBCS code pages.
The values are source ranges in the codepage's code space.
e.g.:
932: F040-F9FC
936: AAA1-AFFE,F8A1-FEFE,A140-A7A0
949: C9A1-C9FE,FEA1-FEFE
950: FA40-FEFE,8E40-A0FE,8140-8DFE,C6A1-C8FE
etc....

These ranges get mapped into Unicode code space starting at U+E000 (the
beginning of the BMP private use area).

> Basically it's JJIS, Shift-JIS or nothing for most Japanese
> companies.  This was my experience working with Konica a few years ago 
> as well.

Don't forget the new JIS X 0213. :)

Bill