[I18n-sig] Re: CJKCodecs 0.9 is released
Martin v. Löwis
martin@v.loewis.de
11 Jun 2003 22:34:15 +0200
Hye-Shik Chang <perky@fallin.lv> writes:
> Legend:
> CJK - CJKCodecs 0.9
> Chinese - ChineseCodecs 1.2.0
> Japanese - JapaneseCodecs 1.4.9
> Korean - KoreanCodecs 2.0.5
> GNU - GNU libiconv 1.8 + iconvcodecs 1.0
Very interesting, again. I have some problems interpreting the data,
though.
> 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
>
> CJK Chinese GNU
> a15a - fffd -
> a1c3 - fffd -
> a1c5 - fffd -
> a1fe - fffd -
> a240 - fffd -
> a2cc - fffd -
> a2ce - fffd -
What does that mean? CJK and iconv gives UnicodeError, whereas
ChineseCodecs puts in the replacement character? Seems like a bug in
ChineseCodecs to me, doesn't it? The replacement character should only
be generated if errors='replace', no?
> 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
>
> CJK Japanese GNU
> 01c0 005c ff3c ff3c
That appears to be a bug in CJK, right? This is the question whether
/xa1/xc0 is REVERSE SOLIDUS or FULLWIDTH REVERSE SOLIDUS. Now, it
appears that euc-jp also supports /x5c, mapped to REVERSE SOLIDUS,
and that /xa1/xc0 should be interpreted as FULLWIDTH REVERSE SOLIDUS,
no?
In case of doubt, I think ICU should be consulted for reference, as
well, and following some kind of majority. In any case, I think the
questionable mappings need to be documented.
> f5a1 e000 - e000 -+ User-Defined Area
> f5a2 e001 - e001 |
> .... |
> fefd e3aa - e3aa |
> fefe e3ab - e3ab -+
What are these? I cannot find them in glibc.
> 4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis
>
> CJK Japanese GNU
> 005c 00a5 005c 00a5
Here, I would trust GNU iconv; 5C really is YEN SIGN.
> 007e 203e 007e 203e
Likewise for OVERLINE - is there really no TILDE in shift-jis?
> 007f - 007f 007f
Why that?
> 815f 005c ff3c ff3c
Again: Why that? Shouldn't /x81/x5f be FULLWIDTH REVERSE SOLIDUS?
> 817f - 00d7 -
What character is that? Why does JapaneseCodecs map it to
MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e;
is that a typo in JapaneseCodecs?
> 837f - 30df -
Are you sure glibc does not support that? Seems to be
KATAKANA LETTER MI.
> 9e7f - 684e -
Why does JapaneseCodecs do that? glibc maps 9e7e to 684e.
> f040 e000 - e000 -+ User-Defined Area
> f041 e001 - e001 |
> .... |
> f9fb e756 - e756 |
> f9fc e757 - e757 -+
Again: What are these characters?
> 5) CJKCodecs' cp932 versus JapaneseCodecs' ms932
>
> CJK Japanese GNU
> 8160 ff5e ff5e 301c
> 8161 2225 2225 2016
Are these verified against MS CP932, e.g. from Windows XP?
> 817f - 00d7 -
Likewise: For CP932, it seems essential to do whatever Microsoft does,
in any Windows version.
> 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
>
> CJK Chinese GNU
> fffd a2ce a2ce -
BIG-5 has the notion of a replacement character????
> 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
>
> CJK Japanese GNU
> 00a5 - 5c 5c
That seems wrong, too.
> 203e - 7e 7e
Likewise.
> Okay, the comparison says that we need some discussions on the mappings.
> I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions
> about mapping inconsistencies. :)
I'm too lazy now to review the other encodings. I'd encourage you to
consult ICU for established procedures, and to document the cases
where you pick one of the possible alternatives. I do hope that this
set of codecs becomes part of standard Python one day, at which point
we really need to document what exactly they do.
Regards,
Martin