[I18n-sig] Re: CJKCodecs 0.9 is released

Sat, 21 Jun 2003 15:46:43 -0400

Martin v. L=F6wis writes:
> Tom Emerson <tree@basistech.com> writes:
> > For Japanese encodings there are numerous corporate extensions to
> > Shift JIS, as well as the various emoticons and other dingbats
> > introduced for use with iMode and other phones.=20
>=20
> That only tells me that mapping to the PUA is most likely incorrect,
> though:
>=20
> Are these corporate extensions well-specified=3F Are they
> non-overlapping=3F

Well specified=3F Sure, there are specifications.

Non-overlapping=3F Of course not: each corporate extension starts at th=
e
same point in the user-defined regions of the legacy encoding.

> If yes, I think a "proper" mapping should be found. For example, many=

> emoticons and dingbats are supported in Unicode 4.0, and should be
> used instead of the PUA.

Absolutely you should, but these characters have to be proposed to the
Unicode Consortium, and in the case ideographs, to the IRG of ISO
10646.

> If no, I feel that these characters just shouldn't round-trip. There
> would bo no loss of data. Instead, users would get a UnicodeError,
> indicating that some characters just can't be converted to Unicode.

This is a rediculously pedantic approach that will end up pissing
people off: the PUA in Unicode is designed for this purpose, so it
should be used.

> Now, there might be certain applications where this is not
> acceptable. For many of these applications, it is the runtime error
> that is not acceptable, not a possible loss of data in rare cases. Fo=
r
> these cases, the 'replace' processing of Python codecs seems
> appropriate.

Data loss is a problem. Customers get very upset when their data gets
munged for no good reason.

> For a small number of applications, round-tripping is important enoug=
h
> even if it means to use the PUA. It is important that authors of thes=
e
> applications understand that they can *only* convert back the results=

> to the original encoding, and not to some other encoding - e.g. it is=

> incorrect to encode the Unicode strings as UTF-8, for use in HTML.

Where does it say you cannot cannot encode PUA characters in UTF-8=3F I=
f
you have a custom font that handles these code points, then you are
going to be upset that you can't display them because the author of
the codec decided that PUA characters are an abomination that should
be striken from the earth.

> Authors of these applications would need to specify that they
> understand all that, e.g. by using a different codec name (e.g. a
> '+pua' suffix)

So then you get a pile of ShiftJIS encodings, those that round trip,
those that don't.

> Again, assuming it is round-tripping that you are after. Many Python
> Unicode applications don't do round-tripping. Instead, they convert
> the input to some other encoding (put it into a database, output
> UTF-8, output XML character references). This is a perfect recipe for=

> moji-bake.

I disagree that this is a recipe for moji-bake. If I'm stuffing values
into a database PUA may be the only thing we can do. I do not want my
ShiftJIS extension characters being replaced with U+FFFD.

> > In the case of HKSCS all but 3 characters are defined in Planes 0 a=
nd
> > 2. However, as I mentioned above, if you do not know that your file=

> > claiming Big Five is really HKSCS then you can't map the UDR/VDR
> > sections appropriately.
>=20
> Can you give an example where using the HKSCS codec for decoding woul=
d
> be incorrect=3F

I can dig up the three characters that are not encoded in Unicode: I
don't have the latest HKSCS at home. But again, if you do not know you
are looking at HKSCS, you loose.

> > Oh, and Microsoft defines CP950 as different things depending on
> > whether the file is from Taiwan or Hong Kong.
>=20
> That sounds like one needs two versions of cp950...

Sure, if you know which version you are dealing which you may not.

> In any case, for MS code pages, I think a Python codec should do
> exactly what MS does. If that involves PUA, oh well, atleast the
> moji-bake will be consistent with what Microsoft produces, so MSIE
> might even render it correctly..

Yes, well, it can be a fulltime job to keep up to date with
Microsoft's ever-changing mapping tables.

Peace,

   tree

--=20
Tom Emerson                                          Basis Technology C=
orp.
Software Architect                                 http://www.basistech=
.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever=
"