Internationalization bug?? [Python 2.2.1, RedHat 8.0, Swedish]

Sun Oct 13 07:26:08 EDT 2002

If len() returns the number of bytes, what can Urban Anjar use to get the
number of characters?

"Martin v. Loewis" <martin at v.loewis.de> wrote in message
news:m3of9zjm90.fsf at mira.informatik.hu-berlin.de...
> urban.anjar at hik.se (Urban Anjar) writes:
>
> > >>> S = 'åäö'
> > >>> print S
> > åäö
> > >>> print len(S)
> > 6
> > Seems like every swedish character occupies 2 byte
> > and len() returns number of byte but not number of
> > characters...
>
> It appears you are using an UTF-8 locale. In UTF-8, every accented
> latin character takes two bytes; many characters (CJK in particular)
> even take three bytes.
>
> You are somewhat misguided assuming that each character takes only a
> single byte. If that was the case, you could only support 256
> characters, but UTF-8 (and Unicode) supports many more characters.
>
> Perhaps you misinterpreted the meaning of the len function: For a byte
> string, it gives you the number of bytes, not (necessarily) the number
> of characters.
>
> To work with characters, you may want to try Unicode. If you do
>
> s = unicode(s,"utf-8")
> print len(s)
>
> you should see that you really have three characters only.
>
> > Of course I can analyze how characters are representated in detail
> > and make some kind of workaround, but I think this is not the Python
> > way. In assembler or C I have to think of things like that but do I
> > have to do that in Python?
>
> If you use byte strings, yes. If you use Unicode strings, you can
> revert the string on the character level.
>
> Of course, to print it on your terminal, you have to convert it back
> to the encoding your terminal uses, i.e.
>
> s = rev(s)
> print s.encode("utf-8")
>
> Regards,
> Martin