[I18n-sig] Re: CJKCodecs 0.9 is released
Hye-Shik Chang
perky@fallin.lv
Wed, 11 Jun 2003 17:13:01 +0900
On Wed, Jun 11, 2003 at 07:25:02AM +0200, Martin v. L?wis wrote:
> Hye-Shik Chang <perky@fallin.lv> writes:
>
> > I don't think CJKCodecs can replace Chinese and JapaneseCodecs immediately.
> > But, CJKCodecs will be remain useful in respect of abililty to support
> > inter-cjk encodings like ISO-2022-JP-2 and ISO-2022-INT-1.
>
> This is an interesting summary. Can you produce another comparison,
> showing the differences in output of these codecs? Particular
> interesting might be cp932, euc-jp, iso-2022-jp, big5, and gb2312.
> For these, please find out
> a) which characters are encoded in one codec that are not encoded
> in the other (i.e. Unicode code point -> encoding)
> b) which characters are decoded in one codec that are not decoded
> in the other (i.e. encoding -> Unicode code point)
> c) which characters are encoded differently
> d) which characters are decoded differently
>
Legend:
CJK - CJKCodecs 0.9
Chinese - ChineseCodecs 1.2.0
Japanese - JapaneseCodecs 1.4.9
Korean - KoreanCodecs 2.0.5
GNU - GNU libiconv 1.8 + iconvcodecs 1.0
1. DECODERS
1) CJKCodecs' gb2312 versus ChineseCodecs' euc-gb2312-cn
exactly identical, but ChineseCodecs raises not UnicodeError but
IndexError for incompleted multibyte sequences.
2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
CJK Chinese GNU
a15a - fffd -
a1c3 - fffd -
a1c5 - fffd -
a1fe - fffd -
a240 - fffd -
a2cc - fffd -
a2ce - fffd -
and, chinesetw.big5 codec has same problem with chinesecn.gb2312
3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
CJK Japanese GNU
01c0 005c ff3c ff3c
f5a1 e000 - e000 -+ User-Defined Area
f5a2 e001 - e001 |
.... |
fefd e3aa - e3aa |
fefe e3ab - e3ab -+
ffa1 e3ac - - -+ CJKCodecs' bug ;)
ffa2 e3ad - - |
.... |
fffd e408 - - |
fffe e409 - - -+
4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis
CJK Japanese GNU
005c 00a5 005c 00a5
007e 203e 007e 203e
007f - 007f 007f
815f 005c ff3c ff3c
817f - 00d7 -
837f - 30df -
....
9e7f - 684e -
9f7f - 6bef -
a040 - 6f3e -
a041 - 6f13 -
....
a0fb - 74d4 -
a0fc - 73f1 -
e07f - 70dd -
e17f - 75ff -
e27f - 7ab0 -
....
e97f - 9a43 -
ea7f - 9eef -
f040 e000 - e000 -+ User-Defined Area
f041 e001 - e001 |
.... |
f9fb e756 - e756 |
f9fc e757 - e757 -+
5) CJKCodecs' cp932 versus JapaneseCodecs' ms932
CJK Japanese GNU
00a1 - ff61 ff61 -+ CJKCodecs' bug ;)
00a2 - ff62 ff62 |
.... |
00de - ff9e ff9e |
00df - ff9f ff9f -+
8160 ff5e ff5e 301c
8161 2225 2225 2016
817c ff0d ff0d 2212
817f - 00d7 -
8191 ffe0 ffe0 00a2
8192 ffe1 ffe1 00a3
81ca ffe2 ffe2 00ac
837f - 30df -
847f - 043d -
....
a0fb - 74d4 -
a0fc - 73f1 -
e07f - 70dd -
e17f - 75ff -
....
e97f - 9a43 -
ea7f - 9eef -
6) CJKCodecs' euc-kr versus KoreanCodecs' euc-kr
exactly identical
7) CJKCodecs' cp949 versus KoreanCodecs' cp949
exactly identical
2. ENCODERS
1) CJKCodecs' gb2312 versus ChineseCodecs' euc-gb2312-cn
exactly identical
2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
CJK Chinese GNU
fffd a2ce a2ce -
3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
CJK Japanese GNU
00a5 - 5c 5c
203e - 7e 7e
e000 f5a1 - f5a1 -+ User-Defined Area
e001 f5a2 - f5a2 |
.... |
e3aa fefd - fefd |
e3ab fefe - fefe |
e3ac 8ff5a1 - 8ff5a1 |
e3ad 8ff5a2 - 8ff5a2 |
.... |
e756 8ffefd - 8ffefd |
e757 8ffefe - 8ffefe -+
ff3c - a1c0 a1c0
ff5e - - 8fa2b7
4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis
CJK Japanese GNU
005c 815f 5c -
007e - 7e -
007f - 7f 7f
e000 f040 - f040 -+ User-Defined Area
e001 f041 - f041 |
.... |
e756 f9fb - f9fb |
e757 f9fc - f9fc -+
ff3c - 815f 815f
5) CJKCodecs' cp932 versus JapaneseCodecs' ms932
CJK Japanese GNU
0080 - 80 -
00a1 - 21 - -+ latin-1 -> ascii
00a5 - 5c - | fallbacks.
.... |
00fe - 74 - |
00ff - 79 - -+
2116 8782 8782 fa59
2121 8784 8784 fa5a
....
2168 875c 875c fa52
2169 875d 875d fa53
2170 eeef fa40 fa40
2171 eef0 fa41 fa41
....
2178 eef7 fa48 fa48
2179 eef8 fa49 fa49
2225 8161 8161 -
3094 - 8394 -
3231 878a 878a fa58
4e28 ed4c fa68 fa68
4ee1 ed4d fa69 fa69
....
9e19 eeeb fc4a fc4a
9ed1 eeec fc4b fc4b
f8f0 - a0 -
f8f1 - fd -
f8f2 - fe -
f8f3 - ff -
f929 edc4 fae0 fae0
f9dc eecd fbe9 fbe9
....
ff02 eefc fa57 fa57
ff07 eefb fa56 fa56
ff0d 817c 817c -
ff5e 8160 8160 -
ffe0 8191 8191 -
ffe1 8192 8192 -
ffe2 81ca 81ca fa54
ffe4 eefa fa55 fa55
6) CJKCodecs' euc-kr versus KoreanCodecs' euc-kr
exactly identical
7) CJKCodecs' cp949 versus KoreanCodecs' cp949
exactly identical
Okay, the comparison says that we need some discussions on the mappings.
I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions
about mapping inconsistencies. :)
Regards,
Hye-Shik =)