[Python-Dev] Hashes in Python3.5 for tuples and frozensets

Thu May 17 04:16:16 EDT 2018

On Thu, May 17, 2018 at 5:21 PM, Anthony Flury via Python-Dev
<python-dev at python.org> wrote:
> Victor,
> Thanks for the link, but to be honest it will just confuse people - neither
> the link or the related bpo entries state that the fix is only limited to
> strings. They simply talk about hash randomization - which in my opinion
> implies ALL hash algorithms; which is why I asked the question.
>
> I am not sure how much should be exposed about the scope of security fixes
> but you can understand my (and other's) confusion.
>
> I am aware that applications shouldn't make assumptions about the value of
> any given hash value - apart from some simple assumptions based hash value
> equality (i.e. if two objects have different hash values they can't be the
> same value).

The hash values of Python objects are calculated by the __hash__
method, so arbitrary objects can do what they like, including
degenerate algorithms such as:

class X:
    def __hash__(self): return 7

So it's impossible to randomize ALL hashes at the language level. Only
str and bytes hashes are randomized, because they're the ones most
likely to be exploitable - for instance, a web server will receive a
query like "http://spam.example/target?a=1&b=2&c=3" and provide a
dictionary {"a":1, "b":2, "c":3}. Similarly, a JSON decoder is always
going to create string keys in its dictionaries (JSON objects). Do you
know of any situation in which an attacker can provide the keys for a
dict/set as integers?

> /B//TW : //
> //
> //This question was prompted by a question on a social media platform about
> the whether hash values are transferable between across platforms.
> Everything I could find stated that after Python 3.3 ALL hash values were
> randomized - but that clearly isn't the case; and the original questioner
> identified that some hash values are randomized and other aren't.//
> /

That's actually immaterial. Even if the hashes weren't actually
randomized, you shouldn't be making assumptions about anything
specific in the hash, save that *within one Python process*, two equal
values will have equal hashes (and therefore two objects with unequal
hashes will not be equal).

> //I did suggest strongly to the original questioner that relying on the same
> hash value across different platforms wasn't a clever solution - their
> original plan was to store hash values in a cross system database to enable
> quick retrieval of data (!!!). I did remind the OP that a hash value wasn't
> guaranteed to be unique anyway - and they might come across two different
> values with the same hash - and no way to distinguish between them if all
> they have is the hash. Hopefully their revised design will store the key,
> not the hash./

Uhh.... if you're using a database, let the database do the work of
being a database. I don't know what this "cross system database" would
be implemented in, but if it's a proper multi-user relational database
engine like PostgreSQL, it's already going to have way better indexing
than anything you'd do manually. I think there are WAY better
solutions than worrying about Python's inbuilt hashing.

If you MUST hash your data for sharing and storage, the easiest
solution is to just use a cryptographic hash straight out of
hashlib.py.

ChrisA