[I18n-sig] PEP 263 and Japanese native encodings

M.-A. Lemburg mal@lemburg.com
Wed, 06 Mar 2002 13:49:58 +0100


Tamito KAJIYAMA wrote:
> 
> I read the PEP 263: Defining Python Source Code Encodings
> (revision 1.9).  Here some comments after a discussion on the
> PEP in a Japanese Python mailing list.
> 
> First of all, as a Japanese Python programmer, I would like to
> use three Japanese native encodings EUC-JP, Shift_JIS and
> ISO-2022-JP as a file encoding of Python source files.  I think
> these encodings are considered "ASCII compatible" in the sense
> you mention in the following paragraph in the "Concepts" section:
> 
>   Only ASCII compatible encodings are allowed as source code
>   encoding to assure that Python language elements other than
>   literals and comments remain readable by ASCII processing tools
>   and to avoid problems with wide characters encodings such as
>   UTF-16.
> 
> However, a participant of the discussion in the Japanese Python
> mailing list says, among the three Japanese encodings, Shift_JIS
> and ISO-2022-JP are *not* ASCII compatible.  He defines ASCII
> compatibility as follows:
> 
>   An ASCII compatible encoding (character set) is a superset of
>   the ASCII encoding (character set) in which octets from 0x00
>   to 0x7f are only used to represent ASCII characters and not
>   used in a series of bytes that represent a multibyte character
>   (such as Kanji and Hiragana).
> 
> This definition is too restrictive IMHO, but anyway the term
> "ASCII compatible" is somewhat obscure and needs clarification
> since there are at least two interpretations. 

As far as the Python tokenizer/compiler is concerned, it
will only have to be able to read the first two lines
and then decode the information found there as described in
the PEP.

That said, ASCII compatible encoding in the PEP description
means that you can represent the standard printable characters 
including the line end characters of the ASCII encoding using 
ASCII ordinals.

I only wanted to avoid having to support two or more byte 
encodings such as UTF-16 since these make the magic
comment recognition much more difficult.

> For the sake of
> the PEP's readers, it's also useful to provide a (partial) list
> of encodings that can be used as a file encoding.
> 
> In summary, the questions to be raised are:
> 
> o What does the term "ASCII compatible" mean?
> o Are three Japanese native encodings EUC-JP, Shift_JIS and
>   ISO-2022-JP "ASCII compatible"?

Yes, provided they have no problem representing the first two 
lines of a source files as e.g.:

#!/usr/bin/python -uOO
# -*- coding: iso-2022-jp -*-
 
> Anyway, thank you for the great proposal.  It will enhance the
> utility of the language for non-Latin Python programmers once
> implemented in the language core.  I really hope that.

Thanks.

Since I will be busy the next two months, Martin has volunteered
to head on with the implementation. I hope that we can have
phase 1 implemented in Python 2.3.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/