Problem with sets and Unicode strings

Tue Jun 27 18:05:51 EDT 2006

Dennis Benzinger wrote:
> Serge Orlov wrote:
>> On 6/27/06, Dennis Benzinger <Dennis.Benzinger at gmx.net> wrote:
>>> Hi!
>>>
>>> The following program in an UTF-8 encoded file:
>>>
>>>
>>> # -*- coding: UTF-8 -*-
>>>
>>> FIELDS = ("Fächer", )
>>> FROZEN_FIELDS = frozenset(FIELDS)
>>> FIELDS_SET = set(FIELDS)
>>>
>>> print u"Fächer" in FROZEN_FIELDS
>>> print u"Fächer" in FIELDS_SET
>>> print u"Fächer" in FIELDS
>>>
>>>
>>> gives this output
>>>
>>>
>>> False
>>> False
>>> Traceback (most recent call last):
>>>    File "test.py", line 9, in ?
>>>      print u"FÃ€cher" in FIELDS
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
>>> ordinal not in range(128)
>>>
>>>
>>> Why do the first two print statements succeed and the third one fails
>>> with an exception?
>> Actually all three statements fail to produce correct result.
> 
> So this is a bug in Python?

No.

>> frozenset remove the exception?
>>
>> Because sets use hash algorithm to find matches, whereas the last
>> statement directly compares a unicode string with a byte string. Byte
>> strings can only contain ascii characters, that's why python raises an
>> exception. The problem is very easy to fix: use unicode strings for
>> all non-ascii strings.
> 
> No, byte strings contain characters which are at least 8-bit wide 
> <http://docs.python.org/ref/types.html>. But I don't understand what 
> Python is trying to decode and why the exception says something about 
> the ASCII codec, because my file is encoded with UTF-8.

Please read

   http://www.amk.ca/python/howto/unicode

The string in all of the containers (FIELDS, FROZEN_FIELDS, FIELDS_SET) is a 
regular byte string, not a Unicode string. The encoding declaration only 
controls how the file is parsed. The string literal that you use for FIELDS is a 
regular string literal, not a Unicode string literal, so the object it creates 
is an 8-bit byte string. The tuple containment test is attempting to compare 
your Unicode string object to the regular string object for equality. Python 
does these comparisons by attempting to decode the regular string into a Unicode 
string. Since there is no encoding information present on regular strings at 
this point (since the encoding declaration in your file only controls parsing, 
nothing else), Python assumes ASCII and throws an exception otherwise.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco