[I18n-sig] Re: CJKCodecs 0.9 is released

Martin v. Löwis martin@v.loewis.de
21 Jun 2003 23:16:22 +0200


Tom Emerson <tree@basistech.com> writes:

> This is a rediculously pedantic approach that will end up pissing
> people off: the PUA in Unicode is designed for this purpose, so it
> should be used.

It is fine if users are aware that this happens. If they are not, they
will be pissed off when they find out.

> Where does it say you cannot cannot encode PUA characters in UTF-8? If
> you have a custom font that handles these code points, then you are
> going to be upset that you can't display them because the author of
> the codec decided that PUA characters are an abomination that should
> be striken from the earth.

And if you don't have such a font, you will see some replacement
characters.

A lot of things need to be in place for this to work
correctly. Developers need to make sure things all are in place, and
need to ask the libraries to work to how they put them.

> I disagree that this is a recipe for moji-bake. If I'm stuffing values
> into a database PUA may be the only thing we can do. I do not want my
> ShiftJIS extension characters being replaced with U+FFFD.

Now, if your font was meant for a different proprietary extension that
happens to use the same private characters, you get incorrect
display. Right? Likewise, if some other application reads out the
data, and interprets the private characters in a different way.

Private characters should never leave the scope of "the application",
and some effort should be done to make sure they don't leak out of
"the application".

> > Can you give an example where using the HKSCS codec for decoding would
> > be incorrect?
> 
> I can dig up the three characters that are not encoded in Unicode: I
> don't have the latest HKSCS at home. But again, if you do not know you
> are looking at HKSCS, you loose.

This is not what I meant. What I'm asking is this: Are there HKSCS
character that have encodings which are identical to encodings in
other common Big-5 extensions?

IOW, what bad things would happen if you would assume all Big-5 is
HSKCS? Or: how would the use of PUAs improve the situation in that
case?

> > That sounds like one needs two versions of cp950...
> 
> Sure, if you know which version you are dealing which you may not.

That is always the case: If I don't know the encoding of some
document, there is always the risk of misinterpretation. I can use
heuristics to guess the encoding in some cases, and in some cases, the
heuristics work reasonable well - in other cases, they fail miserably.

There is nothing one can do, except to have users always declare their
encodings properly, to use only data formats which include charset
declarations, to use only charset names that are unambiguous,
preferably even over time, etc. If people don't follow these rules,
some things will go wrong. Then, people will learn to correct their
errors.

Regards,
Martin