Problem with sets and Unicode strings

Wed Jun 28 12:40:26 EDT 2006

Serge Orlov wrote:
> On 6/27/06, Dennis Benzinger <Dennis.Benzinger at gmx.net> wrote:
>> Serge Orlov wrote:
>> > On 6/27/06, Dennis Benzinger <Dennis.Benzinger at gmx.net> wrote:
>> >> Hi!
>> >>
>> >> The following program in an UTF-8 encoded file:
>> >>
>> >>
>> >> # -*- coding: UTF-8 -*-
>> >>
>> >> FIELDS = ("Fächer", )
>> >> FROZEN_FIELDS = frozenset(FIELDS)
>> >> FIELDS_SET = set(FIELDS)
>> >>
>> >> print u"Fächer" in FROZEN_FIELDS
>> >> print u"Fächer" in FIELDS_SET
>> >> print u"Fächer" in FIELDS
>> >>
>> >>
>> >> gives this output
>> >>
>> >>
>> >> False
>> >> False
>> >> Traceback (most recent call last):
>> >>    File "test.py", line 9, in ?
>> >>      print u"FÃ€cher" in FIELDS
>> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in 
>> position 1:
>> >> ordinal not in range(128)
>> >>
>> >>
>> >> Why do the first two print statements succeed and the third one fails
>> >> with an exception?
>> >
>> > Actually all three statements fail to produce correct result.
>>
>> So this is a bug in Python?
> 
> No.
> 
>> > frozenset remove the exception?
>> >
>> > Because sets use hash algorithm to find matches, whereas the last
>> > statement directly compares a unicode string with a byte string. Byte
>> > strings can only contain ascii characters, that's why python raises an
>> > exception. The problem is very easy to fix: use unicode strings for
>> > all non-ascii strings.
>>
>> No, byte strings contain characters which are at least 8-bit wide
>> <http://docs.python.org/ref/types.html>.
> 
> Yes, but later it's written that non-ascii characters do not have
> universal meaning assigned to them. In other words if you put byte
> 0xE4 into a bytes string all python knows about it is that it's *some*
> character. If you put character U+00E4 into a unicode string python
> knows it's a "latin small letter a with diaeresis". Trying to compare
> *some* character with a specific character is obviously undefined.
 > [...]

But <http://docs.python.org/ref/comparisons.html> says:

Strings are compared lexicographically using the numeric equivalents 
(the result of the built-in function ord()) of their characters. Unicode 
and 8-bit strings are fully interoperable in this behavior.

Doesn't this mean that Unicode and 8-bit strings can be compared and 
this comparison is well defined? (even if it's is not meaningful)

Thanks for your anwsers,
Dennis