[I18n-sig] PEP 263 and Japanese native encodings

07 Mar 2002 08:54:17 +0100

Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:

> Shift_JIS is not ASCII compatible in a similar way.  It uses
> backslash as a second byte.  Here is another example:
> 
>   >>> u"\u8868".encode("japanese.sjis")
>   '\225\\'

I see. I missed the part that the second byte can be in the range
0x40-0xFC. If I understand the problem correctly, the quotation
characters (", ') can *not* appear as the second byte, right?
Also, there is a total of 60 characters that end in byte \x5C;
and those will only cause a problem if immediately followed by
a quoting character.

Do you think those 60 characters would cause a problem in real life?
Or is that a problem that only exists on paper?

> This is a well-known and highly annoying problem of Python in
> Japanese Windows environment in which Shift_JIS is the system's
> default encoding.  There is a patch for Python specifically
> fixing this problem.

A patch specifically designed for Shift_JIS probably is not acceptable
to Python. A patch solving the general problem (in some way) may be.

> So, a definition of ASCII compatible encodings is very important
> since it may or may not accept Shift_JIS and ISO-2022-JP.  I
> believe other Asian native encodings are in a similar situation
> with the two Japanese encodings.

All the EUC encodings (EUC-KR, EUC-ZH) should be ASCII
compatible. BIG5 has the same problem as Shift_JIS. Dunno about
GB2312.

> I don't want the PEP to exclude the two widely used Japanese
> encodings, especially Shift_JIS.

Then you need to propose an implementation strategy, and that strategy
should *not* be "special-case Shift_JIS", and it also should not be
"use the C library's multibyte functions".

In phase 2 of the PEP, both Shift_JIS and ISO-2022-JP will be
acceptable source encodings - but we are in search of an
implementation strategy for that as well. So anybody working on this
would be encouraged to implement Phase 2 of the PEP. Until then, I
suggest to live with the limitation that 60 characters cannot appear
as the last character in a string.

Regards,
Martin