min max from tuples in list

Thu Dec 12 21:36:56 EST 2013

On Thu, 12 Dec 2013 13:54:10 +0100, Peter Otten wrote:

> Steven D'Aprano wrote:
> 
>> In any case, sorting in Python is amazingly fast. You may be pleasantly
>> surprised that a version that sorts your data, while nominally O(N log
>> N), may be much faster than an O(N) solution that doesn't require
>> sorted data. If I were a betting man, I'd be willing to wager a shiny
>> new dollar[1] that sorting works out faster for reasonable sized sets
>> of data.
> 
> Well, that was my first reaction, too. But then

[snip timeit results]

On my system, I don't get quite the same results as you. I find that the 
groupby solution is slightly faster than the dict solution, although both 
are slightly faster than the version I came up with. All three are plenty 
fast enough though. Using the same function definitions as you, I get:

py> from timeit import Timer
py> setup = "from __main__ import time_dict, time_groupby, time_daprano"
py> t1 = Timer("_ = time_dict()", setup)
py> t2 = Timer("_ = time_groupby()", setup)
py> t3 = Timer("_ = time_daprano()", setup)

and results:

py> min(t1.repeat(number=100000, repeat=6))
6.7522243447601795
py> min(t2.repeat(number=100000, repeat=6))
6.7246054373681545
py> min(t3.repeat(number=100000, repeat=6))
7.116707382723689

But, there's a flaw in the above analysis. The list that you give is pre-
sorted, and Python's sort routine is *very* efficient at sorting already 
sorted lists. So I randomly shuffled it and redid the tests. Now the 
advantage of the dict solution is clear:

py> from random import shuffle
py> shuffle(a)
py> min(t1.repeat(number=100000, repeat=6))
6.611802507191896
py> shuffle(a)
py> min(t2.repeat(number=100000, repeat=6))
8.499449279159307
py> shuffle(a)
py> min(t3.repeat(number=100000, repeat=6))
9.61335832811892

The dict solution is hardly changed (slightly faster, although I guess 
that is just due to random timing fluctuations and not meaningful). The 
other two, on the other hand, are quite a bit slower with unsorted data.

Nevertheless, it should be pointed out that even in the worst case, it's 
still pretty damn fast: less than 0.0001s to process a list with 78 
items, or about 1.3 microseconds per item.

A couple of other observations:

- The collect() function is the same algorithm as the minmax_groupby() 
function, the only difference being the implementation. I would expect 
(but haven't tested) that the collect() version using a deque instead of 
manually iterating over the items in each group would be faster if there 
were a lot of items per group.

- The dict version has an advantage that hashing integers is fast. If 
hashing the keys is slower, it may no longer be quite so good.

- The dict version keeps a list of all the values it sees, including 
duplicates. Instead of a list, a set may be sufficient:

def minmax_dict(items):
    d = collections.defaultdict(set)
    for key, value in items:
        d[key].add(value)
    for key, values in d.items():
        yield key, min(values), max(values)

It might even be worthwhile experimenting with something which simply 
stores the minimum and maximum values seen, rather than all the values. 
If there are a lot of items in each group, or if the comparisons are 
expensive, calculating the min and max in a single pass may be worthwhile:

http://code.activestate.com/recipes/577916-fast-minmax-function/

-- 
Steven