[I18n-sig] Re: CJKCodecs 0.9 is released

Martin v. Löwis martin@v.loewis.de
11 Jun 2003 22:34:15 +0200


Hye-Shik Chang <perky@fallin.lv> writes:

> Legend:
>     CJK - CJKCodecs 0.9
>     Chinese - ChineseCodecs 1.2.0
>     Japanese - JapaneseCodecs 1.4.9
>     Korean - KoreanCodecs 2.0.5
>     GNU - GNU libiconv 1.8 + iconvcodecs 1.0

Very interesting, again. I have some problems interpreting the data,
though.

>    2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> 
>                 CJK             Chinese         GNU
>         a15a    -               fffd            -
>         a1c3    -               fffd            -
>         a1c5    -               fffd            -
>         a1fe    -               fffd            -
>         a240    -               fffd            -
>         a2cc    -               fffd            -
>         a2ce    -               fffd            -

What does that mean? CJK and iconv gives UnicodeError, whereas
ChineseCodecs puts in the replacement character? Seems like a bug in
ChineseCodecs to me, doesn't it? The replacement character should only
be generated if errors='replace', no?

>    3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> 
>                 CJK             Japanese        GNU
>         01c0    005c            ff3c            ff3c

That appears to be a bug in CJK, right? This is the question whether
/xa1/xc0 is REVERSE SOLIDUS or FULLWIDTH REVERSE SOLIDUS.  Now, it
appears that euc-jp also supports /x5c, mapped to REVERSE SOLIDUS,
and that /xa1/xc0 should be interpreted as FULLWIDTH REVERSE SOLIDUS,
no?

In case of doubt, I think ICU should be consulted for reference, as
well, and following some kind of majority. In any case, I think the
questionable mappings need to be documented.

>         f5a1    e000            -               e000    -+ User-Defined Area
>         f5a2    e001            -               e001     |
>             ....                                         |
>         fefd    e3aa            -               e3aa     |
>         fefe    e3ab            -               e3ab    -+

What are these? I cannot find them in glibc.

>    4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis
> 
>                 CJK             Japanese        GNU
>         005c    00a5            005c            00a5

Here, I would trust GNU iconv; 5C really is YEN SIGN.

>         007e    203e            007e            203e

Likewise for OVERLINE - is there really no TILDE in shift-jis?

>         007f    -               007f            007f

Why that?

>         815f    005c            ff3c            ff3c

Again: Why that? Shouldn't /x81/x5f be FULLWIDTH REVERSE SOLIDUS?

>         817f    -               00d7            -

What character is that? Why does JapaneseCodecs map it to
MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e;
is that a typo in JapaneseCodecs?

>         837f    -               30df            -

Are you sure glibc does not support that? Seems to be
KATAKANA LETTER MI.


>         9e7f    -               684e            -

Why does JapaneseCodecs do that? glibc maps 9e7e to 684e.

>         f040    e000            -               e000    -+ User-Defined Area
>         f041    e001            -               e001     |
>             ....                                         |
>         f9fb    e756            -               e756     |
>         f9fc    e757            -               e757    -+

Again: What are these characters?

>     5) CJKCodecs' cp932 versus JapaneseCodecs' ms932
> 
>                 CJK             Japanese        GNU
>         8160    ff5e            ff5e            301c
>         8161    2225            2225            2016

Are these verified against MS CP932, e.g. from Windows XP?

>         817f    -               00d7            -

Likewise: For CP932, it seems essential to do whatever Microsoft does,
in any Windows version.

>     2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> 
>                 CJK             Chinese         GNU
>         fffd    a2ce            a2ce            -

BIG-5 has the notion of a replacement character????

>     3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> 
>                 CJK             Japanese        GNU
>         00a5    -               5c              5c

That seems wrong, too.

>         203e    -               7e              7e

Likewise.

> Okay, the comparison says that we need some discussions on the mappings.
> I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions
> about mapping inconsistencies. :)

I'm too lazy now to review the other encodings. I'd encourage you to
consult ICU for established procedures, and to document the cases
where you pick one of the possible alternatives. I do hope that this
set of codecs becomes part of standard Python one day, at which point
we really need to document what exactly they do.

Regards,
Martin