Cache memory and its effect on list searching

Chris Angelico rosuav at gmail.com
Sat Dec 17 01:34:03 EST 2016


On Sat, Dec 17, 2016 at 5:12 PM, Steve D'Aprano
<steve+python at pearwood.info> wrote:
> On Sat, 17 Dec 2016 03:55 pm, Chris Angelico wrote:
>
>> More than that: the lists are searched in linear time, the binary
>> seach runs in logarithmic time, but the set lookup is constant time.
>> Doesn't matter how big your set is, you can test for membership with
>> one hash lookup.
>
> To be pedantic: it is constant time on average.

True. And technically, you could say that a linear or binary search
has a best-case of constant time (if you find it right at the
beginning or middle of the list). I was talking about the average.

> But outside of contrived or hostile examples, where elements of the set are
> designed to collide, typically you would expect hash collisions to be rare,
> and long chains of colliding elements even rarer.  For random elements,
> most will require only a single hash and a single equality test, a small
> number might require two tests, three would be even rarer, and so forth.

Yep. Normal case, it's going to be one or two tests - and that doesn't
depend on the length of the list. The degenerate case does, but the
normal case doesn't.

> So strictly speaking, and I realise I'm being exceedingly pedantic here, a
> *sufficiently large* dict or set MUST have colliding elements. How large
> is "sufficiently large"? Its at least 2**31, more than two billion, so
> while you are right that *in practice* set/dict lookups require only a
> single hash + equality, in principle (and sometimes in practice too)
> collisions can be significant.

That assumes the hash has a finite size. If you have enough memory to
store that many unique objects, you can probably afford to use a
hashing algorithm that allows enough room. However, the chances of not
having collisions depend on the capacity. And the chances of having no
collisions at all are pretty low... as a rule of thumb, I estimate on
a 50-50 chance of a collision when capacity is the square of usage. So
if you have a thousand entries but capacity for a million, you have a
50% chance of having at least one collision. (The numbers aren't quite
right, but it's a good rule of thumb.)

> Nevertheless, I mention this only for completeness. In practice, you almost
> never need to care about hash collisions except for the most pathological
> cases.

Indeed. That's why hash tables are so prevalent, despite their
appalling worst-case.

ChrisA



More information about the Python-list mailing list