[I18n-sig] JapaneseCodecs 1.4.8 released

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Fri, 6 Sep 2002 10:38:05 +0900


martin@v.loewis.de (Martin v. Loewis) writes:
| 
| > The only one reason for choosing the Microsoft mapping is that
| > it seems better.  The Consortium's mapping has a problem that
| > both 0x5c and 0x815f in Shift_JIS are mapped to U+005c, which
| > is in turn mapped to 0x5c in Shift_JIS.  In other words, the
| > Consortium's mapping is one-to-many.
| 
| I can agree on the mapping of 0x815f; it maps to U+FF3C on glibc. I'm
| confused about 0x5c; glibc maps it to U+00A5 (YEN SIGN).
| 
| Also, where did you get the mapping from the Consortium? I can't find
| a current table, but
| 
| http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT
| 
| maps 0x5C to U+00A5, and 0x815F to 0x005C. So this roundtrips just
| fine.

I've finally understood what was wrong: the mapping in
JapaneseCodecs has a number of bugs!  The Unicode Consortium's
mapping is totally okay, but it had not been implemented in
JapaneseCodecs in the right way (I intended to do so, though).

I got the Consortium's mapping from the URL shown above.
However, I happened to carelessly modify the original mapping
as follows:

  the Unicode Consortium's original mapping:
    0x5c   -> U+00A5 -> 0x5c
    0x7e   -> U+203e -> 0x7e
    0x815f -> U+005c -> 0x815f

  the current (buggy) mapping in JapaneseCodecs:
    0x5c   -> U+005c -> 0x5c
    0x7e   -> U+007e -> 0x7e
    0x815f -> U+005c -> 0x815f

In other words, I had introduced the non-reversibility problem
myself!  I'd like to hit my head against the wall thousands of
times...

It seems that there are two solutions: the one is to implement
the Consortium's mapping intact, and the other is to fix the
current buggy mapping so that 0x815f maps to U+ff3c (the latter
means that Java's mapping is adopted, I believe).

| > Sorry, I not sure I've got the picture of what transliteration
| > support would do.  Transliteration support is meant to solve
| > interoperability problems due to differences among vendor-
| > specific mappings, right?
| 
| No. In general, transliteration adds one-way mappings, to allow
| mapping a larger subset of Unicode to the target mapping. For example,
| "=F6" is not supported in ASCII, but a common transliteration (for
| German) is to write "oe". So, u"\u00f6".encode("ascii") raises a
| UnicodeError, where u"\u00f6".encode("ascii//translit-german") might
| return "oe" (this is not implemented in Python).
| 
| Therefore, a transliteration mapping never roundtrips - but it is
| still useful as it attempts to map as much of Unicode to the target
| encoding as reasonable. In your specific case, you could use
| transliteration to map both the default form and the full-width form
| from Unicode to the same JIS - but only one of the forms will
| round-trip.
| 
| I agree that round-trip support is a valuable, and should be the
| default. I do think there is also a need for a "best effort" mapping.

I see.  Transliteration, in the context of JapaneseCodecs, can
be used to provide fallback mappings, right?  I agree that such
a "best effort" mapping is useful and surely needed in a variety
of applications.

Thank a lot!

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>