Problem with sets and Unicode strings

Tue Jun 27 17:04:21 EDT 2006

Serge Orlov wrote:
> On 6/27/06, Dennis Benzinger <Dennis.Benzinger at gmx.net> wrote:
>> Hi!
>>
>> The following program in an UTF-8 encoded file:
>>
>>
>> # -*- coding: UTF-8 -*-
>>
>> FIELDS = ("Fächer", )
>> FROZEN_FIELDS = frozenset(FIELDS)
>> FIELDS_SET = set(FIELDS)
>>
>> print u"Fächer" in FROZEN_FIELDS
>> print u"Fächer" in FIELDS_SET
>> print u"Fächer" in FIELDS
>>
>>
>> gives this output
>>
>>
>> False
>> False
>> Traceback (most recent call last):
>>    File "test.py", line 9, in ?
>>      print u"FÃ€cher" in FIELDS
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
>> ordinal not in range(128)
>>
>>
>> Why do the first two print statements succeed and the third one fails
>> with an exception?
> 
> Actually all three statements fail to produce correct result.

So this is a bug in Python?

> frozenset remove the exception?
> 
> Because sets use hash algorithm to find matches, whereas the last
> statement directly compares a unicode string with a byte string. Byte
> strings can only contain ascii characters, that's why python raises an
> exception. The problem is very easy to fix: use unicode strings for
> all non-ascii strings.

No, byte strings contain characters which are at least 8-bit wide 
<http://docs.python.org/ref/types.html>. But I don't understand what 
Python is trying to decode and why the exception says something about 
the ASCII codec, because my file is encoded with UTF-8.

Dennis