[Python-Dev] Adding Japanese Codecs to the distro

16 Jan 2003 13:02:04 +0100

"M.-A. Lemburg" <mal@lemburg.com> writes:

> I was suggesting to make Suzuki's codecs the default. That
> doesn't prevent Tamito's codecs from working, since these
> are inside a package.

I wonder who will be helped by adding these codecs, if anybody who
needs to process Japanese data on a regular basis will have to install
that other package, anyway.

> If someone wants the C codecs, we should provide them as
> separate download right alongside of the standard distro (as
> discussed several times before).

I still fail to see the rationale for that (or, rather, the rationale
seems to vanish more and more). AFAIR, "size" was brought up as an
argument against the code. However, the code base already contains
huge amounts of code that not everybody needs, and the size increase
on a binary distribution is rather minimal.

> Note that the C codecs are not as easy to modify to special
> needs as the Python ones. While this may seem unnecessary
> I've heard from a few people that especially companies tend
> to extend the mappings with their own set of company specific
> code points.

The Python codecs are not easy to modify, either: there is a large
generated table, and you actually have to understand the generation
algorithm, augment it, run it through Jython. After that, you get a
new mapping table, which you need to carry around *instead* of the one
shipped with Python. So any user who wants to extend the mapping needs
the generator more than the generated output.

If you want to augment the codec as-is, i.e. by wrapping it, you best
install a PEP 293 error handler. This works nicely both with C codecs
and pure Python codecs (out of the box, it probably works with neither
of the candidate packages, but that would have to be fixed).

Or, if you don't go the PEP 293, you can still use a plain wrapper
around both codecs.

> We already have this on Windows (via the mbcs codec). 

That is insufficient, though, since it gives access to a single
platform codec only. I have some code sitting around that exposes the
codecs from inet.dll (or some such); this is the codec library that
IE6 uses.

> If you could contribute your iconv codecs under the PSF license we'd
> go a long way in that direction on Unix as well.

Ok, will do. There are still some issues with the code itself that
need to be fixed, then I'll contribute it.

> > *If* Suzuki's code is incorporated, I'd like to get independent
> > confirmation that it is actually correct.
> 
> Since he built the codecs on the mappings in Java, this
> looks like enough third party confirmation already.

Not really. I *think* Sun has, when confronted with a
popularity-or-correctness issue, taken the popularity side, leaving
correctness alone. Furthermore, the code doesn't use the Java tables
throughout, but short-cuts them. E.g. in shift_jis.py, we find

        if i < 0x80:                    # C0, ASCII
            buf.append(chr(i))

where i is a Unicode codepoint. I believe this is incorrect: In
shift-jis, 0x5c is YEN SIGN, and indeed, the codec goes on with

        elif i == 0xA5:                 # Yen
            buf.append('\\')

So it maps both REVERSE SOLIDUS and YEN SIGN to 0x5c; this is an
error (if it was a CP932 codec, it might (*) have been correct). See

http://rf.net/~james/Japanese_Encodings.txt

Regards,
Martin

(*) I'm not sure here, it also might be that Microsoft maps YEN SIGN
to the full-width yen sign, in CP 932.