[I18n-sig] Changing case
Guido van Rossum
guido@python.org
Tue, 11 Apr 2000 10:55:04 -0400
The story continues...
I tried the following in Python 1.6a2p1 on Windows NT 4.0 in three
interpreters: IDLE, command line, and Pythonwin (win32all-130 using
Python 1.6a2p1). (Since I live in the US, I don't have any way to
input non-ASCII characters; so I use escape sequences for input.)
>>> s = '\351\344' # This is e-egu a-umlaut in Latin-1
>>> u = unicode(s, "latin-1") # This simply yields u"\351\344"
>>> print s
(see table below)
>>> print u
(see table below)
>>>
I got the following results:
print s print u
------- -------
IDLE: e-egu a-umlaut e-egu a-umlaut
command line: THETA SIGMA three graphics + n~
Pythonwin: e-egu a-umlaut A~ (C) A~ o-with-cross
I tried the same thing on Solaris in IDLE and the command line; IDLE
on Solaris did exactly the same thing as it did on Windows, and the
command line on Solaris did exactly the same thing as Pythonwin (!)
did on Windows.
I tried the same thing with IDLE from Python 1.6a1 and also got the
same results -- from this I conclude that Tcl/Tk 8.2 and 8.3 behave
the same way in this respect.
My theory why IDLE has the highest success rate: Tcl/Tk 8.2 uses UTF-8
internally, but falls back to Latin-1 when you use non-ASCII
characters that are clearly not UTF-8. Thus, "print u" displays the
correct value because Tkinter converts Unicode to UTF-8, and "print s"
displays the correct value because Tcl/Tk recognizes that it's not
UTF-8 and thus interprets it as Latin-1.
The command line (running in a DOS box) uses a default code page which
bears no relation to Latin-1; the THETA and SIGMA happen to have
codes \351 and \344. The gibberish printed for u is simply what its
UTF-8 encoding ('\303\251\303\244') looks like when interpreted in the
same code page.
Finally, Pythonwin: Scintilla (its text widget) seems to know about
Latin-1 only. The four characters it prints for u are the Latin-1
characters for \303, \251, \303 and \244. This is also true for the
command line on Solaris (using xterm with the default Latin-1
encoding).
Note that IDLE doesn't always print Latin-1 characters correctly! I
was just lucky. For example, the string "\303,\251,\303\251" prints
as A~, comma, (C), comma, e-egu. In other words, \303 and \251 by
themselves are interpreted as Latin-1, while taken together they are
interpreted as UTF-8.
What would be nice? For stdout, to be able to say *independently*
what encoding 8-bit strings are to be assumed when printed, and what
encoding should be used for the output stream. And for this to work
in all three IDEs: IDLE, command line and Pythonwin.
In IDLE, the output stream should be fixed to UTF-8, but a user
working with Latin-1 strings could set the defaults 8-bit string
encoding for output to be Latin-1. Then, print '\351\344' would be
encoded as UTF-8: '\303\251\303\244', which prints as e-egu a-umlaut;
on the other hand, print '\303\251\303\244' would be interpreted as 4
Latin-1 characters, and print as A~ (C) A~ o-with-cross.
In the command line, on Windows the output encoding should be set to
the default MBCS code page, but the default encoding for 8-bit strings
could be set to something user-specified, e.g. Latin-1.
A similar thing should happen for input (and the input and output
should normally be switched together, so that a user entering
e.g. shift-JIS would also get shift-JIS on putput).
This is quite independent of the source encoding when reading from a
file. I have some issues with the current approach (which seems to be
"use whatever bytes you read" and thus defaults to Latin-1 if you use
non-ASCII characters inUnicode string literals; otherwise it's
whatever the user wants it to be. Note in particular that a user who
edits her source code in shift-JIS can currently *not* use shift-JIS
in Unicode literals -- she must use something like
unicode(".....","shift-jis") to get a Unicode string containing the
correct Japanese characters encoded in Unicode.
Of course, when entering source code interactively, this should be
tied to the encoding for stdin.
--Guido van Rossum (home page: http://www.python.org/~guido/)