[I18n-sig] Some input Asian characters not displayed in a Python application in Windows 95

Mon Apr 26 14:58:17 EDT 2004

Concerning the display of simplified Chinese characters in the edit
control (GtkEntry widget) of a Unicode (UTF-8) Python application.  I'm
using Python 2.3.3, GTK+ 2.2.4-2 Runtime Environment, PyGTK 2.0.0 for
Python 2.3.3.  In an English (United States), Windows XP Home Edition all
of the simplified Chinese characters I wanted to be displayed could be
displayed; but in one pair of experiments in Windows 95, 18 +or-11
percent of the simplified characters were displayed as squares instead of
Chinese characters in the Python application.

I have been trying to get a Unicode Python application written by someone
other than myself to accept and display simplified Chinese characters.  I
made some changes and got it to work in Windows XP Home Edition using
Windows XP's Language Bar for inputting simplified Chinese characters
directly into the Python application.  In Windows 95 I used Microsoft's
Global Input Method Editor (IME) 5.02 for simplified Chinese with
simplified Chinese languge support installed along with Internet Explorer
5.5 service pack 2 (IE5.5SP2) and what is titled “Outlook Express 5”
(what I will call here OE5.5, since it is actually version 5.50.4807.1700
in its “Help,” “About Outlook Express”) to input simplified Chinese
characters via pinyin into IE5.5SP2 and OE5.5.  I obtained a TrueType
SimSun font in one of the files simsun.ttf or simsun.ttc, downloaded via
clicking on “Installing Chinese Fonts” at
http://www.langlab.jhu.edu/chineseFonts.html/.  I copied the Chinese
characters from each of those places to the computer clipboard and then
attempted to paste them into the Unicode Python application.  For what I
mainly discuss here I used UTF-8 Unicode encodings throughout these
processes.  I also learned that in English Windows 95 matching the font
in the source and Python application could be important.  So in separate
experiments I matched the MS-Song font in the IE5.5SP2, OE5.5, or
BabelMap source (obtainable from
http://uk.geocities.com/BabelStone1357/Software/BabelMap.html) and the
Python application and later the SimSun font in the IE5.5SP2 or OE5.5
source and the Python application.  But the result, at least in general,
didn't seem to depend on the UTF-8 or GB2312 source encodings or the
choices of MS Song or SimSun fonts in the source or the choice of MS
Song, SimSun, or Arial font in the Python application.  Choosing full- or
half-width displays for the Chinese characters didn't make any difference
either.  Namely the typical result was that for the simplified Chinese
characters corresponding to “han4 zi4” (pinyin for “Chinese character”),
a square followed by the simplified Chinese character for “zi4” was
displayed.  That is the character for “han4” was not displayed; but the
one for “zi4”was displayed.  On the other hand, in English Windows 95
typcially using UTF-8 encoding I could display both “han4” and “zi4” in
1) IE5.5SP2 using a variety of font choices, 2) OE5.5 with MS Song font,
3) the Unicode text editor SC UniPad (obtained from
http://www.unipad.org/download/) using an unspecified font, 4) the
Chinese input method editor Ding Dang Write using a Song font, and 5)
BabelMap with the MS Song font.

There is another curious thing:  In OE5.5 with UTF-8 encoding I could
input the simplified Chinese characters corresponding to the pinyin:

Ni3 de Han4 zi4 hen3 hao3 kan4, ye3 bu4 cuo4.  Ni3 zuo2 tian1 qu4 guo
xue2 xiao4 ma?

and have all of them were displayed correctly using the Song font.  But
having that text highlighted, when I switched to the SimSun font, the
characters for ”han4,” ”cuo4,” and ”ma” were replaced by blanks while the
rest of the characters remained displayed.  The characters replaced by
blanks in OE5.5 were the same ones replaced by squares in much earlier
tests in the Python application with in the latter case font choices not
100-percent clear from my notes.

Now here are the results of a controlled experiemnt:  In Windows 95 with
the UTF-8 encoding and the MS Song font I copied the simplified
characters for “han4 zi4” to the computer clipboard.  I could then paste
them and have them displayed in a blank document of the Unicode text
editor SC UniPad, but once again with “han4” displayed as a square and
“zi4” displayed correctly in the Python application with the UTF-8
encoding and MS-Song font settings.  My conclusion from this experiment
was that there is a problem somewhere in the combination of the Python
application with GTK+ and PyGTK within a Windows-95 operating system.

The hexadecimal codepoints for these two characters are as follows:

                                           Hexadecimal
                                           codepoints in the encodings
Pinyin for the character    Unicode (UTF-16)    GB2312    
han4                                   6C49                         BABA
zi4                                      5B57                        
D7D6

Both of these Unicode codepoints and in fact the codepoints for all of
the 17 characters I have discussed in this e-mail letter lie in the Basic
Multilingual Plane 0 of Unicode.  So this is not a problem of Windows 95
not being able to display supplementary-plane characters from the Unicode
supplementary plane 2.  And, unless I have missed some place important
beyond labeling or messaging in the Python application's code of which I
am unaware, in some experiments I have seen the effect of han4 missing
when I have not not changed the font in the transferring of the
characters from IE5.5SP2 or OE5.5 to the Python application.  (I had read
that in some cases if one did change the font, Windows 95 could try
“looking” for the character encoding on the system code page, in my case
Windows, code-page 1252; if that happened, the codepoint would not be
found by Windows 95 and a square could be displayed instead of the
Chinese character.  But by matching the fonts in the source and the
Python application, assuming I didn't miss any places important beyond
labeling or messaging in the Python code, this should be ruled out.)   

What do you think is the explanation for the square displayed instead of
the simplified character for ”han4”, yet a correct display for the
simplified character “zi4”?  And how can I fix this?  The fact that all
of the problem characters lie in the Basic Multilingual Plane (BMP),
which Windows 95 is supposed to handle, plus the fact that a Web page,
IE5.5SP2, SC UniPad, Ding Dang Write, and BabelMap all DO correctly
display these characters in Windows 95 convinces me that both ”han4” and
”zi4” should be displayable in the Python application in Windows 95, once
the Python, GTK+, PyGTK combination has the correct coding.

I have some guesses: 1) I wondered if the Python application and perhaps
OE5.5 5.50.4807.1700 could be using an older version of Unicode than
IE5.5SP2, in which I didn't find any problems displaying the characters
once things were properly set up.  From the Internet I learned that IE5.5
uses part of Unicode version 3.0 and that at
http://mail.python.org/pipermail/python-dev/2002-July/026576.html on July
15, 2002 Marc Andre Lemburg wrote that the Unicode database in Python,
apparently the version of it in use at that time, was created from
Unicode 3.0.  I coudn't find the version of Unicode used in OE5.5; it
came as a part of the download of IE5.5SP2.   And both “han4” and “zi4”
entered Unicode in version 1.1 of it.  So I haven't found any evidence to
support my hypothesis that the problem I found in the Python application
could be due to a version of Unicode in Python being older than in
IE5.5SP2.  2) Perhaps the Unicode part of the Python application assumes
that Unicode is handled internally by the Windows operating system, which
would be consistent with good operation in Windows XP and problems in
Windows 95.  An important, Unicode-related statement in the Python
application has:

gettext.install(self.config.app_name,  locales_dir, unicode=1)

in it.  Or perhaps the Unicode part of the Python application is not
handled correctly in some other way for a Windows-95 operating system. 
What are the important points here?  Or are there any sample codes that
handle Unicode well in Python that uses GTK+ for an English, Windows-9x
operating system to which you could refer me?  If there is such a case,
this could rule out a problem within GTK+.

I have written the author of the Python application about the main
challenge I face that I am discussing here.  But so far I haven't
received a response from him.  One of the important changes in making the
Python application work for me in Windows XP was changing the encoding
for the file (file_enc) in which text data would be saved from

file_enc = locale.getlocale()[1]

to

file_enc = “utf_8”

.
[As I recall, in Windows XP using locale.getlocale()[1] and
getpreferredencoding() worked well for giving me the encoding for the
system code page of the operating system, in my case English (United
States) or Windows code-page 1252; but neither of these functions gave me
the encoding used at the moment I was inputting simplified Chinese
characters using an input method; file_enc= “mbcs”, standing for
multibyte character set, didn't serve that purpose for me either to
successfully input simplified Chinese into the Unicode (UTF-8), Python
application.  I assume that except for the fact that in Windows 95 the
system code page is an American-National-Standards-Institute (ANSI) code
page, these results would be the same in Windows 95 as well.]  At least
this worked well in Windows XP when inputting simplified Chinese text
using the input method in Windows XP's Language Bar.

For the Windows-95 computer I obtained the SimSun font from the Internet;
the MS Song and MS Hei fonts were obtained as part of installing
simplified Chinese language support as part of IE5.5SP2.  All three of
these fonts are Unicode, TrueType fonts from what I understand from my
reading on the Internet.

I have not tried adding the Microsoft Layer for Unicode to the Python
application and don't know how for certain to install the critical
unicoWS.dll file for use by the Python application or, if necessary,
Python itself.  But from what I have read on the Internet it appears that
doing this may not increase the multilingual capabilities of Windows 95. 
Do you think using unicoWS.dll is important to solve the problem I face?

There is one other problem that I have only been able to work around thus
far.  This is in a Python interpreter window for

>>>locale.setlocale(locale.LC_ALL,'ar-ma')

, where 'ar-ma' is a short expression for the “Arabic (Morocco)” locale,
a line containing something like cp1256 for code page 1256 and “Saudi
Arabia” was displayed on the computer screen, all of which I consider
acceptable, as far as the Arabic language is concerned.  But for

>>>locale.setlocale(locale.LC_ALL,'zh-cn')

I obtained “unsupported locale setting” when I should have obtained
something like “Chinese P.R.C.)” for “Chinese (People's Republic of
China).”  The line:  

>>>locale.setlocale(locale.LC_ALL,'chinese-s')

should also have given something like “Chinese (P.R.C.),” but instead
gave something like Chinese and Taiwan and/or code page 950.  Do you know
any way to set the locale in Python for Chinese (P.R.C.), even, if
necessary, somehow using the hexadecimal value of 0x0804 for what I think
is for the Windows, Chinese (P.R.C.) locale?  I have been working around
this problem in a Python program via:

locale.setlocale(locale.LC_ALL,'')  #(typed as an apostrophe at left, not
typed as a double quotation mark, as it appears on my computer screen)
.
.
.
file_enc = “utf_8”

while inputting and saving simplified Chinese characters.

For background in writing to me I have had a number of years of
programming experience, but not in Python.  I have never written a full
program in Python.  Also I have tried to learn or figure out exactly how
Windows 95 deals with a Unicode codepoint for a Chinese character, the
code page for GB2312 or cp936, a font for the character, and a GB2312
encoding for the character.  Much of my learning on the new subjects has
been to look for or read things on the Internet and to try things or
conduct experiments on computers and draw conclusions from the results,
as you have seen from this e-mail letter.

Most importantly I would appreciate reading experts' opinions on what
could be causing “han4” to not be displayed in the Python application,
which might be related to why it also wasn't displayed in OE5.5 with the
SimSun font, and what you think should be done to fix what I am convinced
is, at least in principle, a fixable problem in the Python application
plus GTK+ plus PyGTK in a Windows-95 computing environment.  What I want
is to have instructions I can give to potential Windows-95, 98, Me, 2000,
and XP users of the Python application for inputting simplified Chinese
characters into it and saving files with those Chinese data in them.  The
Windows-XP solution that I have will probably work with Windows 2000,
aside from perhaps a few changes in how the input method is set up and
perhaps works differently in Windows 2000 compared to in Windows XP.  I
expect a good solution in Windows 95 would probably also work in a
similar fashion in Windows 98 and Me.  Thanks in advance for any help you
give me.

________________________________________________________________
The best thing to hit the Internet in years - Juno SpeedBand!
Surf the Web up to FIVE TIMES FASTER!
Only $14.95/ month - visit www.juno.com to sign up today!