finding items that occur more than once in a list

Tue Mar 18 11:18:04 EDT 2008

Ninereeds <stephenhorne100 at aol.com> writes:

> The dictionary version Chris suggests (and the essentially
> equivalent set-based approach) is doing essentially the same thing
> in a way, but using hashing rather than ordering to organise the
> list and spot duplicates. This is *not* O(n) due to the rate of
> collisions increasing as the hash table fills. If collisions are
> handled by building a linked list for each hash table entry, the
> performance overall is still O(n^2)

This doesn't apply to Python, which implements dict storage as an
open-addressed table and automatically (and exponentially) grows the
table when the number of entries approaches 2/3 of the table size.
Assuming a good hash function, filling the dict should yield amortized
constant time for individual additions.

> I suspect Python handles collisions using linked lists,

Why suspect, it's trivial to check.  dictobject.c says:

    The basic lookup function used by all operations.
    This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
    Open addressing is preferred over chaining since the link overhead for
    chaining would be substantial (100% with typical malloc overhead).