[I18n-sig] Re: Unicode debate
Bill Tutt
billtut@microsoft.com
Thu, 27 Apr 2000 16:50:53 -0700
> Christopher Petrilli petrilli@amber.org <mailto:petrilli%40amber.org>
>> Guido van Rossum [guido@python.org <mailto:guido@python.org>] wrote:
>> I've heard a few people claim that strings should always be considered
>> to contain "characters" and that there should be one character per
>> string element. I've also heard a clamoring that there should only be
>> one string type. You folks have never used Asian encodings. In
>> countries like Japan, China and Korea, encodings are a fact of life,
>> and the most popular encodings are ASCII supersets that use a variable
>> number of bytes per character, just like UTF-8. Each country or
>> language uses different encodings, even though their characters look
>> mostly the same to western eyes. UTF-8 and Unicode is having a hard
>> time getting adopted in these countries because most software that
>> people use deals only with the local encodings. (Sounds familiar?)
> Actually a bigger concern that we hear from our customers in Japan is
> that Unicode has *serious* problems in asian languages. Theey took
> the "unification" of Chinese and Japanese, rather than both, and
> therefore can not represent los of phrases quite right. I can have
> someone write up a better dscription, but I was told by several
> Japanese people that they wouldn't use Unicode come hell or high
> water, basically.
Yeah, not all of the east asian ideographs are availble in Unicode atm. :(
Currently there are two pending extensions to the unified CJK ideographs.
Extension A is slated as part of the BMP. 0x0000 - 0xAAFF in Plane 2 is
currently slated for use by Extension B.
BMP Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2213.pdf
Plane 2 Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2215.pdf
On top of which is there is this serious problem of end user defined
characters in a number of these MBCS encodings.
Win32 OSs handles mapping these characters into Unicode in the following
way:
In the Win32 registry at:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\EUDCCodeRan
ge
There exists several REG_SZ registry values. The names of the values are
MBCS code pages.
The values are source ranges in the codepage's code space.
e.g.:
932: F040-F9FC
936: AAA1-AFFE,F8A1-FEFE,A140-A7A0
949: C9A1-C9FE,FEA1-FEFE
950: FA40-FEFE,8E40-A0FE,8140-8DFE,C6A1-C8FE
etc....
These ranges get mapped into Unicode code space starting at U+E000 (the
beginning of the BMP private use area).
> Basically it's JJIS, Shift-JIS or nothing for most Japanese
> companies. This was my experience working with Konica a few years ago
> as well.
Don't forget the new JIS X 0213. :)
Bill