[I18n-sig] Re: CJKCodecs 0.9 is released
Hye-Shik Chang
perky@i18n.org
Fri, 20 Jun 2003 05:40:31 +0900
On Wed, Jun 11, 2003 at 10:34:15PM +0200, Martin v. L?wis wrote:
> Hye-Shik Chang <perky@fallin.lv> writes:
[snip]
> > 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> >
> > CJK Chinese GNU
> > a15a - fffd -
> > a1c3 - fffd -
> > a1c5 - fffd -
> > a1fe - fffd -
> > a240 - fffd -
> > a2cc - fffd -
> > a2ce - fffd -
>
> What does that mean? CJK and iconv gives UnicodeError, whereas
> ChineseCodecs puts in the replacement character? Seems like a bug in
> ChineseCodecs to me, doesn't it? The replacement character should only
> be generated if errors='replace', no?
According Unicode.org's mapping:
] A number of characters are not currently mapped because
] of conflicts with other mappings. They are as follows:
]
] BIG5 Description Comments
]
] 0xA15A SPACING UNDERSCORE duplicates A1C4
] 0xA1C3 SPACING HEAVY OVERSCORE not in Unicode
] 0xA1C5 SPACING HEAVY UNDERSCORE not in Unicode
] 0xA1FE LT DIAG UP RIGHT TO LOW LEFT duplicates A2AC
] 0xA240 LT DIAG UP LEFT TO LOW RIGHT duplicates A2AD
] 0xA2CC HANGZHOU NUMERAL TEN conflicts with A451 mapping
] 0xA2CE HANGZHOU NUMERAL THIRTY conflicts with A4CA mapping
]
] We currently map all of these characters to U+FFFD REPLACEMENT CHARACTER.
] It is also possible to map these characters to their duplicates, or to
] the user zone.
So, I changed mapping for them to as cp950 does instead of U+FFFD or
user-defined area. I think that's affordable.
BIG5 Unicode Description
0xA15A 0x2574 SPACING UNDERSCORE
0xA1C3 0xFFE3 SPACING HEAVY OVERSCORE
0xA1C5 0x02CD SPACING HEAVY UNDERSCORE
0xA1FE 0xFF0F LT DIAG UP RIGHT TO LOW LEFT
0xA240 0xFF3C LT DIAG UP LEFT TO LOW RIGHT
0xA2CC 0x5341 HANGZHOU NUMERAL TEN
0xA2CE 0x5345 HANGZHOU NUMERAL THIRTY
>
> > 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> >
> > CJK Japanese GNU
> > 01c0 005c ff3c ff3c
>
> That appears to be a bug in CJK, right? This is the question whether
> /xa1/xc0 is REVERSE SOLIDUS or FULLWIDTH REVERSE SOLIDUS. Now, it
> appears that euc-jp also supports /x5c, mapped to REVERSE SOLIDUS,
> and that /xa1/xc0 should be interpreted as FULLWIDTH REVERSE SOLIDUS,
> no?
Right. That makes sense.
>
> In case of doubt, I think ICU should be consulted for reference, as
> well, and following some kind of majority. In any case, I think the
> questionable mappings need to be documented.
>
> > f5a1 e000 - e000 -+ User-Defined Area
> > f5a2 e001 - e001 |
> > .... |
> > fefd e3aa - e3aa |
> > fefe e3ab - e3ab -+
>
> What are these? I cannot find them in glibc.
Quoting Ken Lunde's CJKV Information Processing p.206 table 4-66:
] Table 4-66: Shift-JIS to Unicode and EUC-JP for User-Defined Region
]
] Shift-JIS Unicode EUC-JP
] F040-F0FC E000-E0BB F5A1-F5FE, F6A1-F6FE
] F140-F1FC E0BC-E177 F7A1-F7FE, F8A1-F8FE
] F240-F2FC E178-E233 F9A1-F9FE, FAA1-FAFE
] --snip--
] F940-F9FC E69C-E757 8FFDA1-8FFDFE, 8FFEA1-8FFEFE
>
> > 4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis
> >
> > CJK Japanese GNU
> > 005c 00a5 005c 00a5
>
> Here, I would trust GNU iconv; 5C really is YEN SIGN.
>
> > 007e 203e 007e 203e
>
> Likewise for OVERLINE - is there really no TILDE in shift-jis?
:)
>
> > 007f - 007f 007f
>
> Why that?
That's a bug of CJK. fixed.
>
> > 815f 005c ff3c ff3c
>
> Again: Why that? Shouldn't /x81/x5f be FULLWIDTH REVERSE SOLIDUS?
Then, shift-jis will lose a *reserse solidus*. And, even Unicode.org's
mapping did:
] sjis jisx0208 unicode
] 0x815F 0x2140 0x005C # REVERSE SOLIDUS
>
> > 817f - 00d7 -
>
> What character is that? Why does JapaneseCodecs map it to
> MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e;
> is that a typo in JapaneseCodecs?
>
> > 837f - 30df -
>
> Are you sure glibc does not support that? Seems to be
> KATAKANA LETTER MI.
>
>
> > 9e7f - 684e -
>
> Why does JapaneseCodecs do that? glibc maps 9e7e to 684e.
They are not in shift-jis's byte range. I guess that JapaneseCodecs'
SJIS->EUC macro has a bug around them.
>
> > 5) CJKCodecs' cp932 versus JapaneseCodecs' ms932
> >
> > CJK Japanese GNU
> > 8160 ff5e ff5e 301c
> > 8161 2225 2225 2016
>
> Are these verified against MS CP932, e.g. from Windows XP?
>
> > 817f - 00d7 -
>
> Likewise: For CP932, it seems essential to do whatever Microsoft does,
> in any Windows version.
Okay. Here it is! :)
CJK Japanese GNU WindowsXP
0080 - - - 0080
00a0 - - - f8f0
00fd - - - f8f1
00fe - - - f8f2
00ff - - - f8f3
8160 ff5e ff5e 301c ff5e
8161 2225 2225 2016 2225
817c ff0d ff0d 2212 ff0d
817f - 00d7 - -
8191 ffe0 ffe0 00a2 ffe0
8192 ffe1 ffe1 00a3 ffe1
81ca ffe2 ffe2 00ac ffe2
837f - 30df - -
847f - 043d - -
....
e97f - 9a43 - -
ea7f - 9eef - -
I'll add 0x80, 0xa0, 0xfd, 0xfe, 0xff to CJKCodecs's cp932 to conform
Windows's real mapping.
>
> > 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> >
> > CJK Chinese GNU
> > fffd a2ce a2ce -
>
> BIG-5 has the notion of a replacement character????
mentioned above.
>
> > 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> >
> > CJK Japanese GNU
> > 00a5 - 5c 5c
>
> That seems wrong, too.
>
> > 203e - 7e 7e
>
> Likewise.
>
> > Okay, the comparison says that we need some discussions on the mappings.
> > I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions
> > about mapping inconsistencies. :)
>
> I'm too lazy now to review the other encodings. I'd encourage you to
> consult ICU for established procedures, and to document the cases
> where you pick one of the possible alternatives. I do hope that this
> set of codecs becomes part of standard Python one day, at which point
> we really need to document what exactly they do.
Thank you for the comments. Your suggestions were very helpful to
make CJKCodecs saner.
>
> Regards,
> Martin
>
>
Regards,
Hye-Shik =)