[issue43475] Worst-case behaviour of hash collision with float NaN

Thu Mar 11 09:55:33 EST 2021

New submission from Cong Ma <m.cong at protonmail.ch>:

Summary: CPython hash all NaN values to 0. This guarantees worst-case behaviour for dict if numerous existing keys are NaN. I think by hashing NaN using the generic object (or "pointer") hash instead, the worst-case situation can be alleviated without changing the semantics of either dict or float. However, this also requires changes to how complex and Decimal objects hash, and moreover incompatible change to sys.hash_info. I would like to hear how Python developers think about this matter.

--------

Currently CPython uses the hard-coded macro constant 0 (_PyHASH_NAN, defined in Include/pyhash.h) for the hash value of all floating point NaNs. The value is part of the sys.hashinfo API and is re-used by complex and Decimal in computing its hash in accordance with Python builtin-type documentation. [0]

(The doc [0] specifically says that "[a]ll hashable nans have the same hash value.")

This is normally not a great concern, except for the worst case performance. The problem is that, since they hash to the same value and they're guaranteed to compare unequal to any compatible numeric value -- not even to themselves, this means they're guaranteed to collide.

For this reason I'd like to question whether it is a good idea to have all hashable NaNs with the same hash value.

There has been some discussions about this over the Web for some time (see [1]). In [1] the demo Python script times the insertion of k distinct NaN keys (as different objects) into a freshly created dict. Since the keys are distinct and are guaranteed to collide with each other (if any), the run time of a single lookup/insertion is roughly linear to the existing number of NaN keys. I've recreated the same script using with a more modern Python (attached).

I'd suggest a fix for this worst-case behaviour: instead of returning the hash value 0 for all NaNs, use the generic object (pointer) hash for these objects. As a PoC (also in the attached script), it roughly means

```
class myfloat(float):
    def __hash__(self):
        if self != self:  # i.e., math.isnan(self)
            return object.__hash__(self)
        return super().__hash__(self)
```

This will
- keep the current behaviour of dict intact;
- keep the invariant `a == b implies hash(a) == hash(b)` intact, where applicable;
- uphold all the other rules for Python numeric objects listed in [0];
- make hash collisions no more likely than with object() instances (dict lookup time is amortized constant w.r.t. existing number of NaN keys).

However, it will
- not keep the current rule "All hashable nans have the same hash value";
- not be compatible with the current sys.hash_info API (requires the removal of the "nan" attribute from there and documenting the change);
- require congruent modifications in complex and Decimal too.

Additionally, I don't think this will affect module-level NaN "constants" such as math.nan and how they behave. The "NaN constant" has never been a problem to begin with. It's only the *distinct* NaN objects that may cause the worst-case behaviour.

--------

Just for the record I'd also like to clear some outdated info or misconception about NaN keys in Python dicts. It's not true that NaN keys, once inserted, cannot be retrieved (e.g., as claimed in [1][2]). In Python, they can be, if you keep the original key *object* around by keeping a reference to it (or obtaining a new one from the dict by iterating over it). This, I think, is because Python dict compares for object identity before rich-comparing for equality in `lookdict()` in Objects/dictobject.c, so this works for `d = dict()`:

```
f = float("nan")
d[f] = "value"
v = d[f]
```

but this fails with `KeyError`, as it should:

```
d[float("nan")] = "value"
v = d[float("nan")]
```

In this regard the NaN float object behaves exactly like the object() instance as keys -- except for the collisions. That's why I think at least for floats the "object" hash is likely to work. The solution using PRNG [1] (implemented with the Go language) is not necessary for CPython because the live objects are already distinct.

--------

Links:

[0] https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types
[1] https://research.swtch.com/randhash
[2] https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/

----------
components: Interpreter Core, Library (Lib)
files: nan_key.py
messages: 388508
nosy: congma
priority: normal
severity: normal
status: open
title: Worst-case behaviour of hash collision with float NaN
type: performance
Added file: https://bugs.python.org/file49869/nan_key.py

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue43475>
_______________________________________