[I18n-sig] Re: CJKCodecs 0.9 is released

Tom Emerson tree@basistech.com
Sat, 21 Jun 2003 13:21:16 -0400


Martin v. L=F6wis writes:
[...]
> > Quoting Ken Lunde's CJKV Information Processing p.206 table 4-66:
> > ] Table 4-66: Shift-JIS to Unicode and EUC-JP for User-Defined Regi=
on
> > ]
> > ] Shift-JIS     Unicode     EUC-JP
> > ] F040-F0FC     E000-E0BB   F5A1-F5FE, F6A1-F6FE
> > ] F140-F1FC     E0BC-E177   F7A1-F7FE, F8A1-F8FE
> > ] F240-F2FC     E178-E233   F9A1-F9FE, FAA1-FAFE
> > ] --snip--
> > ] F940-F9FC     E69C-E757   8FFDA1-8FFDFE, 8FFEA1-8FFEFE
>=20
> Is this really necessary=3F Using PUA characters is evil, IMO, and
> should be avoided unless explicitly requested by the application.  If=

> those characters are not supported in Unicode, they can't be really
> important, no=3F

Yes, it is really necessary.

If you want to round trip these encodings then you need to map the
UDR's of the various legacy encodings into the PUA and back again. If
you don't then you can and will loose data.

For Japanese encodings there are numerous corporate extensions to
Shift JIS, as well as the various emoticons and other dingbats
introduced for use with iMode and other phones.=20

It is a much bigger issue for the Chinese encodings: extensions to Big
Five (CP950, ETen, GCCS, HKSCS, etc.) are done in the UDR and VDR
parts of the encoding space. Unfortunately you rarely if ever see such
documents identified as ETen or CP950 or HKSCS: just as Big
Five. Since you cannot easily detect which of thse variants are in use
you need to round trip the UDRs/VDRs through the PUA.

> Or, are you sure that they are still unsupported in Unicode 4.0=3F

In the case of HKSCS all but 3 characters are defined in Planes 0 and
2. However, as I mentioned above, if you do not know that your file
claiming Big Five is really HKSCS then you can't map the UDR/VDR
sections appropriately.

Oh, and Microsoft defines CP950 as different things depending on
whether the file is from Taiwan or Hong Kong.

The latest issue faced with transcoding between legacy Asian encodings
(especially JIS X 0213) and Unicode is the interpretation of
compatibility characters and how strictly you want to enforce the
rules laid out by TUC.

    -tree

--=20
Tom Emerson                                          Basis Technology C=
orp.
Software Architect                                 http://www.basistech=
.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever=
"