[Python-Dev] PEP 263 considered faulty (for some Japanese)

SUZUKI Hisao suzuki@acm.org
Tue, 19 Mar 2002 22:17:47 JST


> And TextEdit cannot save as UTF-8?

It can.  But doing so suffers from "mojibake".

> The primary reason why this is not supported is different, though: it
> would complicate the implementation significantly, atleast the phase 1
> implementation. If people contribute a phase 2 implementation that
> supports the UTF-16 BOM as a side effect, I would personally
> reconsider.

OK, I will write a sample implementation of the "stage2" as soon
as possible, and put it in the public domain.

Anyway, until the stage2 comes true, you can write Japanese
python files only in either EUC-JP or UTF-8 unless you hack up
the interpreter, thus Python remains unsatisfactory to many
present Japanese till the day of UTF-8.  We should either hurry
up or wait still.

As for UTF-16 with BOM, any text outside Unicode literals should
be translated into UTF-8 (not UTF-16).  It is the sole logical
consequence in that UTF-8 is strictly ASCII-compatible and able
to map all the characters in Unicode naturally.  You will write
source codes in UTF-16 as follows:

	s = '<characters>'
	...
	u = unicode(s, 'utf-8')  # not utf-16!

This suggests me that the implementation will be somewhat like
as Stephen J. Turnbull sketches...

N.B. one should write a binary (not character, but, say, image
or audio) data literal as follows:

	b = '\x89\xAB\xCD\xEF'

The stage2 implementation will translate it into UTF-8 exactly
as follows :-)

	b = '\x89\xAB\xCD\xEF'

Hence there is no problem in translating UTF-16 file into UTF-8.
(At least, any UTF-16 python file is impossible totally for now,
allowing it does not hurt anyone here and there.)

--
SUZUKI Hisao          >>> def fib(n): return reduce(lambda x, y:
suzuki@acm.org        ... (x,x[0][-1]+x[1]), [()]*n, ((0L,),1L))