Problem with sets and Unicode strings

Tue Jun 27 16:05:55 EDT 2006

On 6/27/06, Dennis Benzinger <Dennis.Benzinger at gmx.net> wrote:
> Hi!
>
> The following program in an UTF-8 encoded file:
>
>
> # -*- coding: UTF-8 -*-
>
> FIELDS = ("Fächer", )
> FROZEN_FIELDS = frozenset(FIELDS)
> FIELDS_SET = set(FIELDS)
>
> print u"Fächer" in FROZEN_FIELDS
> print u"Fächer" in FIELDS_SET
> print u"Fächer" in FIELDS
>
>
> gives this output
>
>
> False
> False
> Traceback (most recent call last):
>    File "test.py", line 9, in ?
>      print u"FÃ€cher" in FIELDS
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
> ordinal not in range(128)
>
>
> Why do the first two print statements succeed and the third one fails
> with an exception?

Actually all three statements fail to produce correct result.

> Why does the use of set/frozenset remove the exception?

Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that's why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.