[I18n-sig] PEP 263 and Japanese native encodings
Martin v. Loewis
martin@v.loewis.de
07 Mar 2002 08:54:17 +0100
Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:
> Shift_JIS is not ASCII compatible in a similar way. It uses
> backslash as a second byte. Here is another example:
>
> >>> u"\u8868".encode("japanese.sjis")
> '\225\\'
I see. I missed the part that the second byte can be in the range
0x40-0xFC. If I understand the problem correctly, the quotation
characters (", ') can *not* appear as the second byte, right?
Also, there is a total of 60 characters that end in byte \x5C;
and those will only cause a problem if immediately followed by
a quoting character.
Do you think those 60 characters would cause a problem in real life?
Or is that a problem that only exists on paper?
> This is a well-known and highly annoying problem of Python in
> Japanese Windows environment in which Shift_JIS is the system's
> default encoding. There is a patch for Python specifically
> fixing this problem.
A patch specifically designed for Shift_JIS probably is not acceptable
to Python. A patch solving the general problem (in some way) may be.
> So, a definition of ASCII compatible encodings is very important
> since it may or may not accept Shift_JIS and ISO-2022-JP. I
> believe other Asian native encodings are in a similar situation
> with the two Japanese encodings.
All the EUC encodings (EUC-KR, EUC-ZH) should be ASCII
compatible. BIG5 has the same problem as Shift_JIS. Dunno about
GB2312.
> I don't want the PEP to exclude the two widely used Japanese
> encodings, especially Shift_JIS.
Then you need to propose an implementation strategy, and that strategy
should *not* be "special-case Shift_JIS", and it also should not be
"use the C library's multibyte functions".
In phase 2 of the PEP, both Shift_JIS and ISO-2022-JP will be
acceptable source encodings - but we are in search of an
implementation strategy for that as well. So anybody working on this
would be encouraged to implement Phase 2 of the PEP. Until then, I
suggest to live with the limitation that 60 characters cannot appear
as the last character in a string.
Regards,
Martin