UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux & Windows XP

"Martin v. Löwis" martin at v.loewis.de
Tue Feb 22 01:51:33 EST 2005


Mike Dee wrote:
> If I have this in the beginning of my Python script in Linux:
> 
> #!/usr/bin/env python
> # -*- coding: UTF-8 -*-
> 
> should I - or should I not - be able to use non-ASCII characters 
> in strings and in Tk GUI button labels and GUI window titles and in 
> raw_input data without Python returning wrong case in manipulated 
> strings and/or gibberished characters in Tk GUI title? 

If you use byte strings, you should expect moji-bake. The coding
declaration primarily affects Unicode literals, and has little
effect on byte string literals. So try putting a "u" in front
of each such string.

> With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the 
> German / Swedish / Finnish / etc "umlauted" letter A  (= a diaresis; 
> that is an 'A' with two dots above it, or an O with two dots above.)

You explicitly requested that these characters are *not* ISO-8859-1,
by saying that you want them as UTF-8. The LATIN CAPITAL LETTER A WITH
DIAERESIS can be encoded in many different character sets, e.g.
ISO-8859-15, windows1252, UTF-8, UTF-16, euc-jp, T.101, ...

In different encodings, different byte sequences are used to represent
the same character. If you pass a byte string to Tk, it does not know
which encoding you meant to use (this is known in the Python source,
but lost on the way to Tk). So it guesses ISO-8859-1; this guess is
wrong because it really is UTF-8 in your case.

OTOH, if you use a Unicode string, it is very clear what internal
representation each character has.

> How would you go about making a script where a) the user types in any text 
> (that might or might not include umlauted characters) and b) that text then 
> gets uppercased, lowercased or "titled" and c) printed? 

Use Unicode.

> Isn't it enough to have that 'UTF-8 encoding declaration' in the beginning,
> and then just like get the user's raw_input, mangle it about with .title() 
> or some such tool, and then just spit it out with a print statement?

No.

> One can hardly expect the users to type characters like unicode('\xc3\
> xa4\xc3\xb6\xc3\xbc', 'utf-8')u'\xe4\xf6\xfc' u"äöü".encode('utf-8') or 
> whatnot, and encode & decode to and fro till the cows come home just to 
> get a letter or two in their name to show up correctly. 

This is not necessary.

> Am I beyond hope?

Perhaps not. You should, however, familiarize yourself with the notion
of character encodings, and how the same character can have different
byte represenations, and the same byte representation can have different
interpretations as a character. If libraries disagree on how to
interpret bytes as characters, you get moji-bake (ghost characters;
a Japanese term for the problem, as Japanese users are familiar with
the problem for a long time)

The Python Unicode type solves these problems for good, but you
need to use it correctly.

Regards,
Martin



More information about the Python-list mailing list