japanese encoding iso-2022-jp in python vs. perl

Wed Oct 24 03:20:35 EDT 2007

On Oct 23, 3:37 am, kettle <Josef.Robert.No... at gmail.com> wrote:
> Hi,
>   I am rather new to python, and am currently struggling with some
> encoding issues.  I have some utf-8-encoded text which I need to
> encode as iso-2022-jp before sending it out to the world. I am using
> python's encode functions:
> --
>  var = var.encode("iso-2022-jp", "replace")
>  print var
> --
>
>  I am using the 'replace' argument because there seem to be a couple
> of utf-8 japanese characters which python can't correctly convert to
> iso-2022-jp.  The output looks like this:
> ↓東京???日比谷線?北千住行
>
>  However if use perl's encode module to re-encode the exact same bit
> of text:
> --
>  $var = encode("iso-2022-jp", decode("utf8", $var))
>  print $var
> --
>
>  I get proper output (no unsightly question-marks):
> ↓東京メトロ日比谷線・北千住行
>
> So, what's the deal?  

Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.

> Why can't python properly encode some of these
> characters?

Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:
http://www.unicode.org/charts/PDF/UFF00.pdf

  -- Leo.