[I18n-sig] Re: CJKCodecs 0.9 is released

Hye-Shik Chang perky@i18n.org
Fri, 20 Jun 2003 05:40:31 +0900


On Wed, Jun 11, 2003 at 10:34:15PM +0200, Martin v. L?wis wrote:
> Hye-Shik Chang <perky@fallin.lv> writes:
[snip]
> >    2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> > 
> >                 CJK             Chinese         GNU
> >         a15a    -               fffd            -
> >         a1c3    -               fffd            -
> >         a1c5    -               fffd            -
> >         a1fe    -               fffd            -
> >         a240    -               fffd            -
> >         a2cc    -               fffd            -
> >         a2ce    -               fffd            -
> 
> What does that mean? CJK and iconv gives UnicodeError, whereas
> ChineseCodecs puts in the replacement character? Seems like a bug in
> ChineseCodecs to me, doesn't it? The replacement character should only
> be generated if errors='replace', no?

According Unicode.org's mapping:
] A number of characters are not currently mapped because
]         of conflicts with other mappings.  They are as follows:
] 
] BIG5        Description                    Comments
] 
] 0xA15A      SPACING UNDERSCORE             duplicates A1C4
] 0xA1C3      SPACING HEAVY OVERSCORE        not in Unicode
] 0xA1C5      SPACING HEAVY UNDERSCORE       not in Unicode
] 0xA1FE      LT DIAG UP RIGHT TO LOW LEFT   duplicates A2AC
] 0xA240      LT DIAG UP LEFT TO LOW RIGHT   duplicates A2AD
] 0xA2CC      HANGZHOU NUMERAL TEN           conflicts with A451 mapping
] 0xA2CE      HANGZHOU NUMERAL THIRTY        conflicts with A4CA mapping
] 
] We currently map all of these characters to U+FFFD REPLACEMENT CHARACTER.
]         It is also possible to map these characters to their duplicates, or to
]         the user zone.

So, I changed mapping for them to as cp950 does instead of U+FFFD or
user-defined area. I think that's affordable.

BIG5        Unicode     Description

0xA15A      0x2574      SPACING UNDERSCORE
0xA1C3      0xFFE3      SPACING HEAVY OVERSCORE
0xA1C5      0x02CD      SPACING HEAVY UNDERSCORE
0xA1FE      0xFF0F      LT DIAG UP RIGHT TO LOW LEFT
0xA240      0xFF3C      LT DIAG UP LEFT TO LOW RIGHT
0xA2CC      0x5341      HANGZHOU NUMERAL TEN
0xA2CE      0x5345      HANGZHOU NUMERAL THIRTY

> 
> >    3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> > 
> >                 CJK             Japanese        GNU
> >         01c0    005c            ff3c            ff3c
> 
> That appears to be a bug in CJK, right? This is the question whether
> /xa1/xc0 is REVERSE SOLIDUS or FULLWIDTH REVERSE SOLIDUS.  Now, it
> appears that euc-jp also supports /x5c, mapped to REVERSE SOLIDUS,
> and that /xa1/xc0 should be interpreted as FULLWIDTH REVERSE SOLIDUS,
> no?

Right. That makes sense.

> 
> In case of doubt, I think ICU should be consulted for reference, as
> well, and following some kind of majority. In any case, I think the
> questionable mappings need to be documented.
> 
> >         f5a1    e000            -               e000    -+ User-Defined Area
> >         f5a2    e001            -               e001     |
> >             ....                                         |
> >         fefd    e3aa            -               e3aa     |
> >         fefe    e3ab            -               e3ab    -+
> 
> What are these? I cannot find them in glibc.

Quoting Ken Lunde's CJKV Information Processing p.206 table 4-66:
] Table 4-66: Shift-JIS to Unicode and EUC-JP for User-Defined Region
]
] Shift-JIS     Unicode     EUC-JP
] F040-F0FC     E000-E0BB   F5A1-F5FE, F6A1-F6FE
] F140-F1FC     E0BC-E177   F7A1-F7FE, F8A1-F8FE
] F240-F2FC     E178-E233   F9A1-F9FE, FAA1-FAFE
] --snip--
] F940-F9FC     E69C-E757   8FFDA1-8FFDFE, 8FFEA1-8FFEFE

> 
> >    4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis
> > 
> >                 CJK             Japanese        GNU
> >         005c    00a5            005c            00a5
> 
> Here, I would trust GNU iconv; 5C really is YEN SIGN.
> 
> >         007e    203e            007e            203e
> 
> Likewise for OVERLINE - is there really no TILDE in shift-jis?

:)

> 
> >         007f    -               007f            007f
> 
> Why that?

That's a bug of CJK. fixed.

> 
> >         815f    005c            ff3c            ff3c
> 
> Again: Why that? Shouldn't /x81/x5f be FULLWIDTH REVERSE SOLIDUS?

Then, shift-jis will lose a *reserse solidus*. And, even Unicode.org's
mapping did:
] sjis   jisx0208 unicode
] 0x815F  0x2140  0x005C  # REVERSE SOLIDUS



> 
> >         817f    -               00d7            -
> 
> What character is that? Why does JapaneseCodecs map it to
> MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e;
> is that a typo in JapaneseCodecs?
> 
> >         837f    -               30df            -
> 
> Are you sure glibc does not support that? Seems to be
> KATAKANA LETTER MI.
> 
> 
> >         9e7f    -               684e            -
> 
> Why does JapaneseCodecs do that? glibc maps 9e7e to 684e.

They are not in shift-jis's byte range. I guess that JapaneseCodecs'
SJIS->EUC macro has a bug around them.

> 
> >     5) CJKCodecs' cp932 versus JapaneseCodecs' ms932
> > 
> >                 CJK             Japanese        GNU
> >         8160    ff5e            ff5e            301c
> >         8161    2225            2225            2016
> 
> Are these verified against MS CP932, e.g. from Windows XP?
> 
> >         817f    -               00d7            -
> 
> Likewise: For CP932, it seems essential to do whatever Microsoft does,
> in any Windows version.

Okay. Here it is! :)

        CJK    Japanese GNU     WindowsXP
0080    -       -       -       0080
00a0    -       -       -       f8f0
00fd    -       -       -       f8f1
00fe    -       -       -       f8f2
00ff    -       -       -       f8f3
8160    ff5e    ff5e    301c    ff5e
8161    2225    2225    2016    2225
817c    ff0d    ff0d    2212    ff0d
817f    -       00d7    -       -
8191    ffe0    ffe0    00a2    ffe0
8192    ffe1    ffe1    00a3    ffe1
81ca    ffe2    ffe2    00ac    ffe2
837f    -       30df    -       -
847f    -       043d    -       -
    ....
e97f    -       9a43    -       -
ea7f    -       9eef    -       -

I'll add 0x80, 0xa0, 0xfd, 0xfe, 0xff to CJKCodecs's cp932 to conform
Windows's real mapping.

> 
> >     2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> > 
> >                 CJK             Chinese         GNU
> >         fffd    a2ce            a2ce            -
> 
> BIG-5 has the notion of a replacement character????

mentioned above.

> 
> >     3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> > 
> >                 CJK             Japanese        GNU
> >         00a5    -               5c              5c
> 
> That seems wrong, too.
> 
> >         203e    -               7e              7e
> 
> Likewise.
> 
> > Okay, the comparison says that we need some discussions on the mappings.
> > I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions
> > about mapping inconsistencies. :)
> 
> I'm too lazy now to review the other encodings. I'd encourage you to
> consult ICU for established procedures, and to document the cases
> where you pick one of the possible alternatives. I do hope that this
> set of codecs becomes part of standard Python one day, at which point
> we really need to document what exactly they do.

Thank you for the comments. Your suggestions were very helpful to
make CJKCodecs saner.

> 
> Regards,
> Martin
> 
> 


Regards,
    Hye-Shik =)