[Python-Dev] PEP 263 considered faulty (for some Japanese)

Martin v. Loewis martin@v.loewis.de
12 Mar 2002 09:21:49 +0100


SUZUKI Hisao <suzuki611@oki.com> writes:

>    I have also read the Parade of the PEPs and know that it is
> very close to being checked in, so I am writing this message to
> you in English in a hurry.  The PEP 263, as is, will damage the
> usability of Python in Japan.

Please understand that the issue you bring up was specifically added
on request of Japanese users.

>    The PEP says, "Just as in coercion of strings to Unicode,
> Python will default to the interpreter's default encoding (which
> is ASCII in standard Python installations) as standard encoding
> if no other encoding hints are given."  This will let many
> English people free from writing the magic comment to their
> scripts explicitly.  

While that is true, it will in particular free Japanese users from
putting encoding declarations in their files. Japanese users often
declare the default encoding to be shift-jis or euc-jp. When Python
source code is transmitted between Unix and Windows, tools are used to
convert files between these two encodings. If there is an encoding
declaration, those tools would need to change this, too, but the
existing tools don't.

Therefore, it was considered desirable to not use an encoding
declaration if the default encoding matches the file encoding. It is
well-understood that files without declared encoding will be less
portable across systems.

> However, many Japanese set the default encoding other than ASCII (we
> use multi-byte encodings for daily use, not as luxury), and some
> Japanese set it, say, "utf-16".

I cannot believe this statement. Much of the standard library will
break if you set the default encoding to utf-16; any sensible setting
of the default encoding sets it to an ASCII superset (in the sense
"ASCII strings have the same bytes under that encoding"). Anybody
setting the default encoding to utf-16 has much bigger problems than
source encodings.

My personal view is that the default encoding should be left at
"ascii" in all cases, and that explicit code set conversions should be
used in source code throughout.

>    By the PEP as is, persons who use "utf-16" etc. will not be able
> to use many Python scripts any more.  Certainly you can tell them
> not to use "utf-16" as the default encoding.  

It would be good advice to tell them so. However, it would be even
better to tell them that they need to declare the source encoding on
each file they produce.

> But some of them have been writing their scripts in ASCII just as
> specified in the Language Reference, just omitting the encoding
> specification from their scripts to handle their Unicode documents
> easily.  Thus it would be safe to say that it is simply unfair.

There is nothing wrong with writing scripts in ASCII. In phase 1 of
the implementation, you will get away with that if you don't use
Unicode literals. In phase 2, you either need to declare the source
encoding on all files, or change the system encoding. Doing the latter
is better - setting the default encoding to "utf-16" just won't work
in practice.

>    I would propose that Python should default to ASCII as
> standard encoding if no other encoding hints are given, as the
> bottom line.  The interpreter's default encoding should not be
> referred for source code.

The first version of the PEP said so (it actually said that Latin-1 is
the default encoding); then it was changed on request of Japanese
users.

>    And I hope that Python defaults to UTF-8 as standard encoding
> if no other encoding hints are given.  

Isn't that contradictory to what you just said?

> It is ASCII-compatible perfectly and language-neutral.  If you once
> commit yourself to Unicode, I think, UTF-8 is an obvious choice
> anyway.

I certainly agree. Under the PEP, you can put the UTF-8 signature (BOM
encoded as UTF-8) in all files (or ask your text editor to do that for
you), and you won't need any additional encoding declaration. Windows
notepad does that already if you ask it to save files as UTF-8, and
I'd assume other editors will offer that feature as well.

In any case, choice of source encoding, under the PEP, is the user's
choice. The option of making UTF-8 the standard encoding for all
source files has been explicitly considered and was rejected.

>    From my experiences, inserting the '-*- coding: <coding name>
> -*-' line into an existing file and converting such a file into
> UTF-8 are almost the same amount of work.  We will be glad if
> Python understands Japanese (and other) characters by default
> (by adopting, say, UTF-8 as default).

There is no need to adopt anything as the default to understand
Japanese.

Regards,
Martin