UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux & Windows XP

Mon Feb 21 20:30:25 EST 2005

   A very very basic UTF-8 question that's driving me nuts:

If I have this in the beginning of my Python script in Linux:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

should I - or should I not - be able to use non-ASCII characters 
in strings and in Tk GUI button labels and GUI window titles and in 
raw_input data without Python returning wrong case in manipulated 
strings and/or gibberished characters in Tk GUI title? 

With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the 
German / Swedish / Finnish / etc "umlauted" letter A  (= a diaresis; 
that is an 'A' with two dots above it, or an O with two dots above.)

In Linux in the Tk(?) GUI of my 'program' I get an uppercase "A" 
with a tilde above - followed by a general currency symbol ['spider']. 
That is, two wrong characters where a small umlauted letter "a" 
should be. 

But in Windows XP exactly the *same* code (The initiating "#!/usr/bin
/env python" and all..) works just fine in the Tk GUI - non-ascii 
characters showing just as they should. (The code in both cases is 
without any u' prefixes in strings.)

I have UTF-8 set as the encoding of my Suse 9.2 / KDE localization, I 
have saved my 'source code' in UTF-8 format and I have tried to read
*a lot* of information about Unicode and I have heard it said many 
times that Python handles unicode very well -- so why can it be so 
bl**dy difficult to get an umlauted (two-dotted) letter a to be 
properly handled by Python 2.3? In Windows I have Python 2.4 - but the 
following case-insanity applies for Windows-Python as well:

For example, if I do this in my Linux konsole (no difference whether it 
be in KDE Konsole window or the non-gui one via CTRL-ALT-F2):

>>>aoumlautxyz="12xyz"       # number 1 = umlauted a, number 2 = uml o 
>>>print aoumlautxyz.(upper)

then the resulting string is NOT all upper case - it is a lowercase 
umlauted a, then a lowercase umlauted o then uppercase XYZ

And command:

>>> print aoumlautxyz.title()

..results in a string where a-umlaut, o-umlaut and yz are lowercase and 
only the Z in the middle is uppercase.  

this >>>print aoumlautxyz.lower()      

.. prints o.k. 

Am I missing something very basic here? Earlier there was a difference in 
my results between running the scripts in the CTRL ALT F2-konsole and the 
KDE-one, but I think running unicode_start & installing an unicode console
font at some point of time ironed that one out.

If this is due to some strange library, could someone please give me a 
push to a spot where to read about fixing it? Or am I just too stupid, 
and that's it. (I bet that really is what it boils down to..)

<rant>

I cannot be the only (non-pro) person in Europe who might need to use non-
ASCII characters in GUI titles / button labels, in strings provided by the 
users of the software with raw_input (like person's name that begins with 
an umlauted letter or includes one or several of them) ..in comments, and 
so on.

How would you go about making a script where a) the user types in any text 
(that might or might not include umlauted characters) and b) that text then 
gets uppercased, lowercased or "titled" and c) printed? 

Isn't it enough to have that 'UTF-8 encoding declaration' in the beginning,
and then just like get the user's raw_input, mangle it about with .title() 
or some such tool, and then just spit it out with a print statement?

One can hardly expect the users to type characters like unicode('\xc3\
xa4\xc3\xb6\xc3\xbc', 'utf-8')u'\xe4\xf6\xfc' u"äöü".encode('utf-8') or 
whatnot, and encode & decode to and fro till the cows come home just to 
get a letter or two in their name to show up correctly. 

It's a shame that the Linux Cookbook, Learning Python 2nd ed, Absolute 
beginners guide to Python, Running Linux, Linux in a Nutshell, Suse 9.2 
Pro manuals and the online documentation I have bumped into with Google
(like in unicode.org or python.org or even the Python Programming Faq 
1.3.9 / Unicode error) do not contain enough - or simple enough - 
information for a Python/Linux newbie to get 'it'.

For what it's worth, in Kmail my encoding iso ISO8859-1. I tried that 
coding one in my KDE and my Python scripts, earlier too, but it was 
no better; actually that was why I started this Unicode sh..  ..thing. 

Am I beyond hope?

</rant>

Mike     d