[Tutor] surprising len() results ???

Tue Aug 9 12:33:11 CEST 2005

On 8/9/05, Tom Cloyd <tomcloyd at bestmindhealth.com> wrote:

> print len('()ÄäÀÁàáÇçÈÉèéÌÍìíÑñÒÓòóÙÚúù')
> print len('--AaAAaaCcEEeeIIiiNnOOooUUuu')
> 
> the result:
> 
> 54
> 28
> 
> I'm completely mystified by this. All of it. None of it makes sense. This
> program was working fine. Now it doesn't. And those two parameter string
> are plainly the same length, but Python doesn't think so.

on my computer (python 2.3):
>>> s = 'ä' # a-umlaut
>>> len(s)
1
>>> len(s.decode('latin-1')) # prepare for utf-8 encoding
1
>>> len(s.decode('latin-1').encode('utf-8'))
2
>>> len('a'.decode('latin-1').encode('utf-8'))
1

seems like len returns the number of bytes and some encodings uses
more than one byte for certain chars. You can proberbly decode your
strings from utf-8 (or whatever encoding you use (and perhaps encode
it back into a one-char-one-byte encoding [on my system the decoded
(unicode) string is just fine]).

regards 
Michael