[Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys

Sat Aug 5 00:26:02 CEST 2006

M.-A. Lemburg wrote:
> Terry Reedy wrote:
>> "Michael Hudson" <mwh at python.net> wrote in message 
>> news:2m3bccwopj.fsf at starship.python.net...
>>> Michael Chermside <mcherm at mcherm.com> writes:
>>>
>>>> I'm changing the subject line because I want to convince everyone that
>>>> the problem being discussed in the "unicode hell" thread has nothing
>>>> to do with unicode and strings. It's all about dicts.
>>> I'd say it's more to do with __eq__.  It's a strange __eq__ method
>>> that raises an Exception, IMHO.
>> I agree; a == b should always work, certainly unless explicitly programmed 
>> otherwise in Python for a particular class. 
> 
> ... which this is.
> 
>> So I think the proper solution 
>> is fix the buggy __eq__ method to return False instead.  If a byte string 
>> does not have a default (ascii) text interpretation, then it obviously is 
>> not equal to any particular unicode text.
>>
>> The fundamental axiom of sets and hence of dict keys is that any 
>> object/value either is or is not a member (at any given time for 'mutable' 
>> set collections).  This requires that testing an object for possible 
>> membership by equality give a clean True or False answer.
>>
>>> Please do realize that the motivation for this change was hours and
>>> hours of tortous debugging caused by a buggy __eq__ method making keys
>>> "impossibly" seem to not be in dictionaries.
>> So why not fix the buggy __eq__ method?
> 
> Because it's not buggy.
> 
> Python just doesn't know the encoding of the 8-bit string, so can't
> make any assumptions on it. As result, it raises an exception to inform
> the programmer.
> 
> It is well possible that the string uses an encoding where the
> Unicode string is indeed the equal to the string, assuming this
> encoding, e.g.

Isn't this a case where it should be up to the programmer to make sure 
the comparison makes sense in the context it is being used.  That is, if 
I'm comparing two different forms of data, it's up to me to convert them 
explicitly to the same form before comparing them?

In the case of comparing an 8 bit string and unicode, I would think they 
are always unequal.  But changing that now would probably (?) break way 
too much. (but it may also uncover quite a few potential or even real 
bugs as well.) ;-)

Cheers,
    Ron