[I18n-sig] Re: CJKCodecs 0.9 is released

Hye-Shik Chang perky@fallin.lv
Wed, 11 Jun 2003 17:13:01 +0900


On Wed, Jun 11, 2003 at 07:25:02AM +0200, Martin v. L?wis wrote:
> Hye-Shik Chang <perky@fallin.lv> writes:
> 
> > I don't think CJKCodecs can replace Chinese and JapaneseCodecs immediately.
> > But, CJKCodecs will be remain useful in respect of abililty to support
> > inter-cjk encodings like ISO-2022-JP-2 and ISO-2022-INT-1.
> 
> This is an interesting summary. Can you produce another comparison,
> showing the differences in output of these codecs? Particular
> interesting might be cp932, euc-jp, iso-2022-jp, big5, and gb2312.
> For these, please find out
> a) which characters are encoded in one codec that are not encoded
>    in the other (i.e. Unicode code point -> encoding)
> b) which characters are decoded in one codec that are not decoded
>    in the other (i.e. encoding -> Unicode code point)
> c) which characters are encoded differently
> d) which characters are decoded differently
> 

Legend:
    CJK - CJKCodecs 0.9
    Chinese - ChineseCodecs 1.2.0
    Japanese - JapaneseCodecs 1.4.9
    Korean - KoreanCodecs 2.0.5
    GNU - GNU libiconv 1.8 + iconvcodecs 1.0

1. DECODERS

   1) CJKCodecs' gb2312 versus ChineseCodecs' euc-gb2312-cn

      exactly identical, but ChineseCodecs raises not UnicodeError but
      IndexError for incompleted multibyte sequences.

   2) CJKCodecs' big5 versus ChineseCodecs' big5-tw

                CJK             Chinese         GNU
        a15a    -               fffd            -
        a1c3    -               fffd            -
        a1c5    -               fffd            -
        a1fe    -               fffd            -
        a240    -               fffd            -
        a2cc    -               fffd            -
        a2ce    -               fffd            -

      and, chinesetw.big5 codec has same problem with chinesecn.gb2312

   3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp

                CJK             Japanese        GNU
        01c0    005c            ff3c            ff3c
        f5a1    e000            -               e000    -+ User-Defined Area
        f5a2    e001            -               e001     |
            ....                                         |
        fefd    e3aa            -               e3aa     |
        fefe    e3ab            -               e3ab    -+
        ffa1    e3ac            -               -       -+ CJKCodecs' bug ;)
        ffa2    e3ad            -               -        |
            ....                                         |
        fffd    e408            -               -        |
        fffe    e409            -               -       -+


   4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis

                CJK             Japanese        GNU
        005c    00a5            005c            00a5
        007e    203e            007e            203e
        007f    -               007f            007f
        815f    005c            ff3c            ff3c
        817f    -               00d7            -
        837f    -               30df            -
            ....
        9e7f    -               684e            -
        9f7f    -               6bef            -
        a040    -               6f3e            -
        a041    -               6f13            -
            ....
        a0fb    -               74d4            -
        a0fc    -               73f1            -
        e07f    -               70dd            -
        e17f    -               75ff            -
        e27f    -               7ab0            -
            ....
        e97f    -               9a43            -
        ea7f    -               9eef            -
        f040    e000            -               e000    -+ User-Defined Area
        f041    e001            -               e001     |
            ....                                         |
        f9fb    e756            -               e756     |
        f9fc    e757            -               e757    -+

    5) CJKCodecs' cp932 versus JapaneseCodecs' ms932

                CJK             Japanese        GNU
        00a1    -               ff61            ff61    -+ CJKCodecs' bug ;)
        00a2    -               ff62            ff62     |
            ....                                         |
        00de    -               ff9e            ff9e     |
        00df    -               ff9f            ff9f    -+
        8160    ff5e            ff5e            301c
        8161    2225            2225            2016
        817c    ff0d            ff0d            2212
        817f    -               00d7            -
        8191    ffe0            ffe0            00a2
        8192    ffe1            ffe1            00a3
        81ca    ffe2            ffe2            00ac
        837f    -               30df            -
        847f    -               043d            -
            ....
        a0fb    -               74d4            -
        a0fc    -               73f1            -
        e07f    -               70dd            -
        e17f    -               75ff            -
            ....
        e97f    -               9a43            -
        ea7f    -               9eef            -

    6) CJKCodecs' euc-kr versus KoreanCodecs' euc-kr

        exactly identical

    7) CJKCodecs' cp949 versus KoreanCodecs' cp949

        exactly identical



2. ENCODERS

    1) CJKCodecs' gb2312 versus ChineseCodecs' euc-gb2312-cn

       exactly identical

    2) CJKCodecs' big5 versus ChineseCodecs' big5-tw

                CJK             Chinese         GNU
        fffd    a2ce            a2ce            -

    3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp

                CJK             Japanese        GNU
        00a5    -               5c              5c
        203e    -               7e              7e
        e000    f5a1            -               f5a1    -+ User-Defined Area
        e001    f5a2            -               f5a2     |
            ....                                         |
        e3aa    fefd            -               fefd     |
        e3ab    fefe            -               fefe     |
        e3ac    8ff5a1          -               8ff5a1   |
        e3ad    8ff5a2          -               8ff5a2   |
            ....                                         |
        e756    8ffefd          -               8ffefd   |
        e757    8ffefe          -               8ffefe  -+
        ff3c    -               a1c0            a1c0
        ff5e    -               -               8fa2b7

    4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis

                CJK             Japanese        GNU
        005c    815f            5c              -
        007e    -               7e              -
        007f    -               7f              7f
        e000    f040            -               f040    -+ User-Defined Area
        e001    f041            -               f041     |
            ....                                         |
        e756    f9fb            -               f9fb     |
        e757    f9fc            -               f9fc    -+
        ff3c    -               815f            815f

    5) CJKCodecs' cp932 versus JapaneseCodecs' ms932

                CJK             Japanese        GNU
        0080    -               80              -
        00a1    -               21              -       -+ latin-1 -> ascii
        00a5    -               5c              -        | fallbacks.
            ....                                         |
        00fe    -               74              -        |
        00ff    -               79              -       -+
        2116    8782            8782            fa59
        2121    8784            8784            fa5a
            ....
        2168    875c            875c            fa52
        2169    875d            875d            fa53
        2170    eeef            fa40            fa40
        2171    eef0            fa41            fa41
            ....
        2178    eef7            fa48            fa48
        2179    eef8            fa49            fa49
        2225    8161            8161            -
        3094    -               8394            -
        3231    878a            878a            fa58
        4e28    ed4c            fa68            fa68
        4ee1    ed4d            fa69            fa69
            ....
        9e19    eeeb            fc4a            fc4a
        9ed1    eeec            fc4b            fc4b
        f8f0    -               a0              -
        f8f1    -               fd              -
        f8f2    -               fe              -
        f8f3    -               ff              -
        f929    edc4            fae0            fae0
        f9dc    eecd            fbe9            fbe9
            ....
        ff02    eefc            fa57            fa57
        ff07    eefb            fa56            fa56
        ff0d    817c            817c            -
        ff5e    8160            8160            -
        ffe0    8191            8191            -
        ffe1    8192            8192            -
        ffe2    81ca            81ca            fa54
        ffe4    eefa            fa55            fa55

    6) CJKCodecs' euc-kr versus KoreanCodecs' euc-kr

        exactly identical

    7) CJKCodecs' cp949 versus KoreanCodecs' cp949

        exactly identical



Okay, the comparison says that we need some discussions on the mappings.
I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions
about mapping inconsistencies. :)


Regards,
    Hye-Shik =)