finding items that occur more than once in a list

Ninereeds stephenhorne100 at aol.com
Tue Mar 18 08:25:15 EDT 2008


Just to throw in one more alternative, if you sort your list, you only
need to test adjacent items for equality rather than needing a search
for each unique item found. You should get O(n log n) rather than
O(n^2), since the performance bottleneck is now the sorting rather
than the searching for duplicates. Also, since the sort method is
built into the list, sorting performance should be very good.

The dictionary version Chris suggests (and the essentially equivalent
set-based approach) is doing essentially the same thing in a way, but
using hashing rather than ordering to organise the list and spot
duplicates. This is *not* O(n) due to the rate of collisions
increasing as the hash table fills. If collisions are handled by
building a linked list for each hash table entry, the performance
overall is still O(n^2) since (for large amounts of data) there is
still a linear search on each collision. If collisions are handled by
building binary trees, the performance is back to O(n log n). That is,
for large enough datasets, the performance of hash tables is basically
the performance of the collision handling data structure (ignoring
scaling constants).

I suspect Python handles collisions using linked lists, since using
trees would require that all dictionary keys support ordered
comparisons. Using a dictionary or set to eliminate duplicates
therefore gives O(n^2) performance.

That said, hash tables have very good scaling constants to their
performance, so the dictionary technique will be a very good performer
in general. And if the lists are reasonably small the performance will
often seem like O(n) in practice.

The sort-then-filter approach should still be competitive, but of
course it requires that the contents of the list can be ordered
consistently. An inappropriate hash function can similarly cause
problems for the dictionary approach, causing that theoretical O(n^2)
performance to become apparent very quickly. This is probably one
reason why the cookbook recipe recommends an O(n^2) searching-based
approach.



More information about the Python-list mailing list