Internationalization bug?? [Python 2.2.1, RedHat 8.0, Swedish]

Martin v. Loewis martin at v.loewis.de
Sat Oct 12 19:00:43 EDT 2002


urban.anjar at hik.se (Urban Anjar) writes:

> >>> S = 'åäö'
> >>> print S
> åäö
> >>> print len(S)
> 6
> Seems like every swedish character occupies 2 byte 
> and len() returns number of byte but not number of 
> characters...

It appears you are using an UTF-8 locale. In UTF-8, every accented
latin character takes two bytes; many characters (CJK in particular)
even take three bytes.

You are somewhat misguided assuming that each character takes only a
single byte. If that was the case, you could only support 256
characters, but UTF-8 (and Unicode) supports many more characters.

Perhaps you misinterpreted the meaning of the len function: For a byte
string, it gives you the number of bytes, not (necessarily) the number
of characters.

To work with characters, you may want to try Unicode. If you do

s = unicode(s,"utf-8")
print len(s)

you should see that you really have three characters only.

> Of course I can analyze how characters are representated in detail
> and make some kind of workaround, but I think this is not the Python
> way. In assembler or C I have to think of things like that but do I
> have to do that in Python?

If you use byte strings, yes. If you use Unicode strings, you can
revert the string on the character level.

Of course, to print it on your terminal, you have to convert it back
to the encoding your terminal uses, i.e.

s = rev(s)
print s.encode("utf-8")

Regards,
Martin



More information about the Python-list mailing list