Conditional for...in failing with utf-8, Spanish book translation

Mon Apr 21 02:51:12 EDT 2008

On Mon, 21 Apr 2008 08:33:47 +0200, Hunter wrote:

> I've narrowed the problem down to a simple test program. Check this out:
> 
> ---
> 
> # -*- coding: utf-8 -*-
> 
> acceptable = "abcdefghijklmnopqrstuvwxyzóíñú" # this line will work
> acceptable = "abcdefghijklmnopqrstuvwxyzóíñúá" # this line won't
> #wtf?
> 
> word = "¡A"
> word_key = ''.join([c for c in word.lower() if c in acceptable])
> print "word_key = " + word_key
> 
> ---
> 
> Any ideas? I'm really stumped!

You are not working with unicode but UTF-8 encoded characters.  That's
bytes and not letters/characters.  Your `word` for example contains three
bytes and not the two characters you think it contains:

In [43]: word = "¡A"

In [44]: len(word)
Out[44]: 3

In [45]: for c in word: print repr(c)
   ....:
'\xc2'
'\xa1'
'A'

So you are *not* testing if ¡ is in `acceptable` but the two byte values
that are the UTF-8 representation of that character.

Ciao,
	Marc 'BlackJack' Rintsch