[Python-Dev] unicode hell/mixing str and unicode as dictionary keys

Thu Aug 3 19:39:15 CEST 2006

Ralf Schmitt wrote:
>>>> Still trying to port our software. here's another thing I noticed:
>>>>
>>>> d = {}
>>>> d[u'm\xe1s'] = 1
>>>> d['m\xe1s'] = 1
>>>> print d
>>>>
>>>> With python 2.5 I get:
>>>>
>>>> $ python2.5 t2.py
>>>> Traceback (most recent call last):
>>>>    File "t2.py", line 3, in <module>
>>>>      d['m\xe1s'] = 1
>>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: 
>>>> ordinal not in range(128)
>>>>
>> This is because Unicode and 8-bit string keys only work
>> in the same way if and only if they are plain ASCII.
> 
> This is okay. But in the case where one is not ASCII I would prefer to 
> be able to compare them (not equal) instead of getting a UnicodeError.
> I know it's too late to change this, ...

It is too late to change this, since it was always like this ;-)

Seriously, Unicode is doing the right thing here: you should
really always get an exception if you compare apples and
oranges, rather than reverting to comparing the ids of apples
and oranges as fall-back solution.

I believe that Py3k will implement this.

>> The reason lies in the hash function used by Unicode: it is
>> crafted to make hash(u) == hash(s) for all ASCII s, such
>> that s == u.
>>
>> For non-ASCII strings, there are no guarantees as to the
>> hash value of the strings or whether they match or not.
>>
>> This has been like that since Unicode was introduced, so it's
>> not new in Python 2.5.
>>
> 
> ...but in the case of dictionaries this behaviour has changed and in 
> prior versions of python dictionaries did work as I expected them to.
> Now they don't.

Let's put it this way: Python 2.5 uncovered a bug in your
application that has always been there. It's better to
fix your application than arguing to cover up the bug again.

> When working with unicode strings and (accidently) mixing with str 
> strings, things might seem to work until the first non-ascii string
> is given to some code and one gets that UnicodeDecodeError (e.g. when 
> comparing them).
> 
> If one mixes unicode strings and str strings as keys in a dictionary 
> things might seem to work far longer until he tries to put in some non 
> ASCII string with the "wrong" hash value and suddenly things go boom.
> I'd rather keep the pre 2.5 behaviour.

It's actually a good preparation for Py3k where 1 == u'abc' will
(likely) also raise an exception.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 03 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::