Conditional for...in failing with utf-8, Spanish book translation
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Mon Apr 21 02:51:12 EDT 2008
On Mon, 21 Apr 2008 08:33:47 +0200, Hunter wrote:
> I've narrowed the problem down to a simple test program. Check this out:
>
> ---
>
> # -*- coding: utf-8 -*-
>
> acceptable = "abcdefghijklmnopqrstuvwxyzóíñú" # this line will work
> acceptable = "abcdefghijklmnopqrstuvwxyzóíñúá" # this line won't
> #wtf?
>
> word = "¡A"
> word_key = ''.join([c for c in word.lower() if c in acceptable])
> print "word_key = " + word_key
>
> ---
>
> Any ideas? I'm really stumped!
You are not working with unicode but UTF-8 encoded characters. That's
bytes and not letters/characters. Your `word` for example contains three
bytes and not the two characters you think it contains:
In [43]: word = "¡A"
In [44]: len(word)
Out[44]: 3
In [45]: for c in word: print repr(c)
....:
'\xc2'
'\xa1'
'A'
So you are *not* testing if ¡ is in `acceptable` but the two byte values
that are the UTF-8 representation of that character.
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list