[I18n-sig] Changing case

Guido van Rossum guido@python.org
Tue, 11 Apr 2000 10:55:04 -0400


The story continues...

I tried the following in Python 1.6a2p1 on Windows NT 4.0 in three
interpreters: IDLE, command line, and Pythonwin (win32all-130 using
Python 1.6a2p1).  (Since I live in the US, I don't have any way to
input non-ASCII characters; so I use escape sequences for input.)

>>> s = '\351\344' # This is e-egu a-umlaut in Latin-1
>>> u = unicode(s, "latin-1") # This simply yields u"\351\344"
>>> print s
(see table below)
>>> print u
(see table below)
>>> 

I got the following results:

		print s			print u
		-------			-------

IDLE:		e-egu a-umlaut		e-egu a-umlaut

command line:	THETA SIGMA		three graphics + n~

Pythonwin:	e-egu a-umlaut		A~ (C) A~ o-with-cross


I tried the same thing on Solaris in IDLE and the command line; IDLE
on Solaris did exactly the same thing as it did on Windows, and the
command line on Solaris did exactly the same thing as Pythonwin (!)
did on Windows.

I tried the same thing with IDLE from Python 1.6a1 and also got the
same results -- from this I conclude that Tcl/Tk 8.2 and 8.3 behave
the same way in this respect.

My theory why IDLE has the highest success rate: Tcl/Tk 8.2 uses UTF-8
internally, but falls back to Latin-1 when you use non-ASCII
characters that are clearly not UTF-8.  Thus, "print u" displays the
correct value because Tkinter converts Unicode to UTF-8, and "print s"
displays the correct value because Tcl/Tk recognizes that it's not
UTF-8 and thus interprets it as Latin-1.

The command line (running in a DOS box) uses a default code page which
bears no relation to Latin-1; the THETA and SIGMA happen to have
codes \351 and \344.  The gibberish printed for u is simply what its
UTF-8 encoding ('\303\251\303\244') looks like when interpreted in the
same code page.

Finally, Pythonwin: Scintilla (its text widget) seems to know about
Latin-1 only.  The four characters it prints for u are the Latin-1
characters for \303, \251, \303 and \244.  This is also true for the
command line on Solaris (using xterm with the default Latin-1
encoding).

Note that IDLE doesn't always print Latin-1 characters correctly!  I
was just lucky.  For example, the string "\303,\251,\303\251" prints
as A~, comma, (C), comma, e-egu.  In other words, \303 and \251 by
themselves are interpreted as Latin-1, while taken together they are
interpreted as UTF-8.

What would be nice?  For stdout, to be able to say *independently*
what encoding 8-bit strings are to be assumed when printed, and what
encoding should be used for the output stream.  And for this to work
in all three IDEs: IDLE, command line and Pythonwin.

In IDLE, the output stream should be fixed to UTF-8, but a user
working with Latin-1 strings could set the defaults 8-bit string
encoding for output to be Latin-1.  Then, print '\351\344' would be
encoded as UTF-8: '\303\251\303\244', which prints as e-egu a-umlaut;
on the other hand, print '\303\251\303\244' would be interpreted as 4
Latin-1 characters, and print as A~ (C) A~ o-with-cross.

In the command line, on Windows the output encoding should be set to
the default MBCS code page, but the default encoding for 8-bit strings
could be set to something user-specified, e.g. Latin-1.

A similar thing should happen for input (and the input and output
should normally be switched together, so that a user entering
e.g. shift-JIS would also get shift-JIS on putput).


This is quite independent of the source encoding when reading from a
file.  I have some issues with the current approach (which seems to be
"use whatever bytes you read" and thus defaults to Latin-1 if you use
non-ASCII characters inUnicode string literals; otherwise it's
whatever the user wants it to be.  Note in particular that a user who
edits her source code in shift-JIS can currently *not* use shift-JIS
in Unicode literals -- she must use something like
unicode(".....","shift-jis") to get a Unicode string containing the
correct Japanese characters encoded in Unicode.

Of course, when entering source code interactively, this should be
tied to the encoding for stdin.

--Guido van Rossum (home page: http://www.python.org/~guido/)