[Python-Dev] PEP 263 considered faulty (for some Japanese)

Martin v. Loewis martin@v.loewis.de
16 Mar 2002 10:54:02 +0100


"SUZUKI Hisao" <suzuki@acm.org> writes:

> But I wonder what about codecs of the various encodings of various
> countries.  Each of Main-land China, Taiwan, Korea, and Japan has its
> own encoding(s).  They will have their own large table(s) of truly
> many characters.  Will not this make the interpreter a huge one?

Depends on what you count as "the interpreter". Python will use the
codec framework for implementing PEP 263. This supports "pluggable"
codecs, which don't consume any memory until they are used; they do
consume disk space if installed.

Even when Python applications primarily use Unicode internally, I
assume there will be the need for these legacy codecs for a long time,
so people will continue to have them installed. So allowing those
encodings for Python source does not add a new burden.

> Maybe each local codec(s) must be packed into a so-called Country
> Specific Package, which can be optional in the Python distribution.  I
> believe you have considered such thing already.  In additon, I see
> this problem does not relate to PEP 263 itself in the strict sense.
> The PEP just makes use of codecs which happen to be there, only
> requiring that each name of them must match with that of Emacs,
> doesn't it?

Correct. I think the IANA "preferred MIME name" for the encoding
should be used everywhere; this reduces the need for aliases.

Also, I'm in favour of exposing the system codecs (on Linux, Windows,
and the Mac); if that is done, there may be no need to incorporate any
additional codecs in the Python distribution.

> In short:
> 
> If the current PEP regards UTF-8 BOM, why it does not allow UTF-16
> _with_ BOM?  The implementation would be very trivial.  UTF-16 with
> BOM is becoming somewhat popular among casual users in Japan.

This is, to some degree, comparing apples and oranges. The UTF-8 "BOM"
is not a byte order mark - it is just the ZERO WIDTH NO-BREAK SPACE
encoded in UTF-8. UTF-8 does not have the notion of "byte order", so
there can't be a "byte-order mark".

That said, it would be quite possible to support
UTF-16-with-BOM. However, use of UTF-16 in text files is highly
controversial - people have strong feelings against the BOM, used to
denote byte order. Also, people may run into problems if their editors
claim to support UTF-16, but fail to emit the BOM.

The primary reason why this is not supported is different, though: it
would complicate the implementation significantly, atleast the phase 1
implementation. If people contribute a phase 2 implementation that
supports the UTF-16 BOM as a side effect, I would personally
reconsider.

> It is true that many Japanese developers do not use UTF-16 at all (and
> may be even suspicious of anyone who talks about the use of it ;-).
> However, the rest of us sometimes use UTF-16 certainly.  You can edit
> UTF-16 files with, say, jEdit (www.jedit.org) on many platforms,
> including Unix and Windows.  And in particular, you can use TextEdit
> on Mac.  TextEdit on Mac OS X is a counterpart of notepad and wordpad
> on Windows.

And textedit cannot save as UTF-8?

> UTF-16 is typically 2/3 size to UTF-8 when many CJK chararcters are
> used (each of them are 3 bytes in UTF-8 and 2 bytes in UTF-16).  

While I see that this is a problem for arbitrary Japanese text, I
doubt you will find the 2/3 ratio for Python source code containing
Japanese text in string literals and comments.

> The implementation would be fairly straight.  If the file begins in
> either 0xFE 0xFF or 0xFF 0xFE, it must be UTF-16.

But then what? I'm still somewhat doubtful how to implement phase 2 of
the PEP. If UTF-16 support falls out as a side effect of a phase 2
implementation, and if it is useful to some users without causing harm
to others, I'm in favour of this extension.

However, I would hate promising today that phase 2 will support UTF-16
if that then turns out to be unimplementable for some obscure reason.
For example, the parser currently uses fgets to get the next line of
input. For the ASCII-superset encodings, this does the right thing.
For UTF-16, it will break horribly; this actually relates to the
"universal newline support" PEP.

Regards,
Martin