[I18n-sig] Re: CJKCodecs 0.9 is released

21 Jun 2003 21:26:53 +0200

Tom Emerson <tree@basistech.com> writes:

> For Japanese encodings there are numerous corporate extensions to
> Shift JIS, as well as the various emoticons and other dingbats
> introduced for use with iMode and other phones. 

That only tells me that mapping to the PUA is most likely incorrect,
though:

Are these corporate extensions well-specified? Are they
non-overlapping?

If yes, I think a "proper" mapping should be found. For example, many
emoticons and dingbats are supported in Unicode 4.0, and should be
used instead of the PUA.

If no, I feel that these characters just shouldn't round-trip. There
would bo no loss of data. Instead, users would get a UnicodeError,
indicating that some characters just can't be converted to Unicode.

Now, there might be certain applications where this is not
acceptable. For many of these applications, it is the runtime error
that is not acceptable, not a possible loss of data in rare cases. For
these cases, the 'replace' processing of Python codecs seems
appropriate.

For a small number of applications, round-tripping is important enough
even if it means to use the PUA. It is important that authors of these
applications understand that they can *only* convert back the results
to the original encoding, and not to some other encoding - e.g. it is
incorrect to encode the Unicode strings as UTF-8, for use in HTML.

Authors of these applications would need to specify that they
understand all that, e.g. by using a different codec name (e.g. a
'+pua' suffix)

> It is a much bigger issue for the Chinese encodings: extensions to Big
> Five (CP950, ETen, GCCS, HKSCS, etc.) are done in the UDR and VDR
> parts of the encoding space. Unfortunately you rarely if ever see such
> documents identified as ETen or CP950 or HKSCS: just as Big
> Five. Since you cannot easily detect which of thse variants are in use
> you need to round trip the UDRs/VDRs through the PUA.

Again, assuming it is round-tripping that you are after. Many Python
Unicode applications don't do round-tripping. Instead, they convert
the input to some other encoding (put it into a database, output
UTF-8, output XML character references). This is a perfect recipe for
moji-bake.

> In the case of HKSCS all but 3 characters are defined in Planes 0 and
> 2. However, as I mentioned above, if you do not know that your file
> claiming Big Five is really HKSCS then you can't map the UDR/VDR
> sections appropriately.

Can you give an example where using the HKSCS codec for decoding would
be incorrect?

> Oh, and Microsoft defines CP950 as different things depending on
> whether the file is from Taiwan or Hong Kong.

That sounds like one needs two versions of cp950...

In any case, for MS code pages, I think a Python codec should do
exactly what MS does. If that involves PUA, oh well, atleast the
moji-bake will be consistent with what Microsoft produces, so MSIE
might even render it correctly..

Regards,
Martin