[ python-Bugs-1460886 ] Broken __hash__ for Unicode objects

SourceForge.net noreply at sourceforge.net
Thu Mar 30 00:36:50 CEST 2006


Bugs item #1460886, was opened at 2006-03-29 21:16
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1460886&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: None
Status: Closed
Resolution: Invalid
Priority: 5
Submitted By: Joe Wreschnig (piman)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Broken __hash__ for Unicode objects

Initial Comment:
http://docs.python.org/ref/customization.html says
equal objects should hash to the same value. But this
is not the case when the default Unicode encoding has
been changed (by e.g. importing PyGTK).

Using Python 2.4.2:

>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>> hash(u"\u1234"), hash(str(u"\u1234"))
(-518661067, -1855038154)
>>> u"\u1234" == str(u"\u1234")
True

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2006-03-30 00:36

Message:
Logged In: YES 
user_id=38388

You can tweak the default encoding setting in site.py, but
again, you're on your own when doing so.

I also wonder why you would expect str(u"\u1234") (==
'\xe1\x88\xb4' with your setting) to have the same hash
value as a single Unicode code point.

Note that hash values are cached inside the Unicode object.
If we were to let the hash value depend on the current
setting of the default encoding, it would be possible to
have two equal Unicode objects with two different hash values.

It's one of the few places that actually does hard code the
ASCII default encoding. Most others will work with other
encodings as well, but again: no guarantees. Note that
because of this, you don't get a differen hash value for
ASCII Unicode code points and ASCII-compatible default
encodings.

In future versions, the default encoding will go away, so
please don't start relying on it.


----------------------------------------------------------------------

Comment By: Joe Wreschnig (piman)
Date: 2006-03-30 00:03

Message:
Logged In: YES 
user_id=796

What's the point of having a default encoding if it breaks a
fundamental part of the language on anything but the default
value?

I mean, I can tweak site.py to set it to utf-8; does this
become a valid bug then? site.py even contains a check to
set the encoding to an alternate value if I want.

This may be a "known fact". Every reported bug is a known
fact. That doesn't mean it's not a bug.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-03-29 21:54

Message:
Logged In: YES 
user_id=38388

This is a known fact.

When changing the default encoding, you are basically on
your own, so there's nothing much we can about it.

BTW, the above hack that you're using to get at the
sys.setdefaultencoding() API already indicates that you're
leaving the path of standard Python. 

We deliberatly remove that API from the sys module in
site.py to make changing the default encoding an explicit task. 

If importing PyGTK has the side-effect of applying such a
hack, then PyGTK is seriously broken and you should report
this to their developers.

Closing as "Invalid".


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1460886&group_id=5470


More information about the Python-bugs-list mailing list