japanese encoding iso-2022-jp in python vs. perl

Wed Oct 24 06:31:01 EDT 2007

Thanks Leo, and everyone else, these were very helpful replies.  The
issue was exactly as Leo described, and I apologize for not being
aware of it, and thus not quite reporting it correctly.

At the moment I don't care about round-tripping between half-width and
full-width kana, rather I need only be able to rely on any particular
kana character be translated correctly to its half-width or full-width
equivalent, and I need the Japanese I send out to be readable.

I appreciate the 'implicit versus explicit' point, and have read about
it in a few different python mailing lists.  In this instance it seems
that perl perhaps ought to flash a warning notification regarding what
it is doing, but as this conversion between half-width and full-width
characters is by far the most logical one available, it also seems
reasonable that python might perhaps include such capabilities by
default, just as it currently includes the 'replace' option for
mapping missed characters generically to '?'.

I still haven't worked out the entire mapping routine, but Leo's hint
is probably sufficient to get it working with a bit more effort.

Again, thanks for the help.

-Joe

> Thanks that I have my crystal ball working. I can see clearly that the
> forth
> character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
> which is
> not present in ISO-2022-JP as defined by RFC 1468 so python converts
> it into
> question mark as you requested. Meanwhile perl as usual is trying to
> guess what
> you want and silently converts that character into 'KATAKANA LETTER
> ME' (U+30E1)
> which is present in ISO-2022-JP.
>
> > Why can't python properly encode some of these
> > characters?
>
> Because "Explicit is better than implicit". Do you care about
> roundtripping?
> Do you care about width of characters? What about full-width " (U
> +FF02)? Python
> doesn't know answers to these questions so it doesn't do anything with
> your
> input. You have to do it yourself. Assuming you don't care about
> roundtripping
> and width here is an example demonstrating how to deal with narrow
> characters:
>
> from unicodedata import normalize
> iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
> range(0xFF61,0xFFE0))
> print repr(u'\uFF92'.translate(iso2022_squeezing))
>
> It prints u'\u30e1'. Feel free to ask questions if something is not
> clear.
>
> Note, this is just an example, I *don't* claim it does what you want
> for any character
> in FF61-FFDF range. You may want to carefully review the whole unicode
> block:http://www.unicode.org/charts/PDF/UFF00.pdf
>
>   -- Leo.