min max from tuples in list
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Dec 12 21:36:56 EST 2013
On Thu, 12 Dec 2013 13:54:10 +0100, Peter Otten wrote:
> Steven D'Aprano wrote:
>
>> In any case, sorting in Python is amazingly fast. You may be pleasantly
>> surprised that a version that sorts your data, while nominally O(N log
>> N), may be much faster than an O(N) solution that doesn't require
>> sorted data. If I were a betting man, I'd be willing to wager a shiny
>> new dollar[1] that sorting works out faster for reasonable sized sets
>> of data.
>
> Well, that was my first reaction, too. But then
[snip timeit results]
On my system, I don't get quite the same results as you. I find that the
groupby solution is slightly faster than the dict solution, although both
are slightly faster than the version I came up with. All three are plenty
fast enough though. Using the same function definitions as you, I get:
py> from timeit import Timer
py> setup = "from __main__ import time_dict, time_groupby, time_daprano"
py> t1 = Timer("_ = time_dict()", setup)
py> t2 = Timer("_ = time_groupby()", setup)
py> t3 = Timer("_ = time_daprano()", setup)
and results:
py> min(t1.repeat(number=100000, repeat=6))
6.7522243447601795
py> min(t2.repeat(number=100000, repeat=6))
6.7246054373681545
py> min(t3.repeat(number=100000, repeat=6))
7.116707382723689
But, there's a flaw in the above analysis. The list that you give is pre-
sorted, and Python's sort routine is *very* efficient at sorting already
sorted lists. So I randomly shuffled it and redid the tests. Now the
advantage of the dict solution is clear:
py> from random import shuffle
py> shuffle(a)
py> min(t1.repeat(number=100000, repeat=6))
6.611802507191896
py> shuffle(a)
py> min(t2.repeat(number=100000, repeat=6))
8.499449279159307
py> shuffle(a)
py> min(t3.repeat(number=100000, repeat=6))
9.61335832811892
The dict solution is hardly changed (slightly faster, although I guess
that is just due to random timing fluctuations and not meaningful). The
other two, on the other hand, are quite a bit slower with unsorted data.
Nevertheless, it should be pointed out that even in the worst case, it's
still pretty damn fast: less than 0.0001s to process a list with 78
items, or about 1.3 microseconds per item.
A couple of other observations:
- The collect() function is the same algorithm as the minmax_groupby()
function, the only difference being the implementation. I would expect
(but haven't tested) that the collect() version using a deque instead of
manually iterating over the items in each group would be faster if there
were a lot of items per group.
- The dict version has an advantage that hashing integers is fast. If
hashing the keys is slower, it may no longer be quite so good.
- The dict version keeps a list of all the values it sees, including
duplicates. Instead of a list, a set may be sufficient:
def minmax_dict(items):
d = collections.defaultdict(set)
for key, value in items:
d[key].add(value)
for key, values in d.items():
yield key, min(values), max(values)
It might even be worthwhile experimenting with something which simply
stores the minimum and maximum values seen, rather than all the values.
If there are a lot of items in each group, or if the comparisons are
expensive, calculating the min and max in a single pass may be worthwhile:
http://code.activestate.com/recipes/577916-fast-minmax-function/
--
Steven
More information about the Python-list
mailing list