Selecting k smallest or largest elements from a large list in python; (benchmarking)

Arnaud Delobelle arnodel at googlemail.com
Thu Sep 2 02:43:27 EDT 2010


Dmitry Chichkov <dchichkov at gmail.com> writes:

> Given: a large list (10,000,000) of floating point numbers;
> Task: fastest python code that finds k (small, e.g. 10) smallest
> items, preferably with item indexes;
> Limitations: in python, using only standard libraries (numpy & scipy
> is Ok);
>
> I've tried several methods. With N = 10,000,000, K = 10 The fastest so
> far (without item indexes) was pure python implementation
> nsmallest_slott_bisect (using bisect/insert). And with indexes
> nargsmallest_numpy_argmin (argmin() in the numpy array k times).
>
> Anyone up to the challenge beating my code with some clever selection
> algorithm?
>
> Current Table:
> 1.66864395142 mins_heapq(items, n):
> 0.946580886841 nsmallest_slott_bisect(items, n):
> 1.38014793396 nargsmallest(items, n):
> 10.0732769966 sorted(items)[:n]:
> 3.17916202545 nargsmallest_numpy_argsort(items, n):
> 1.31794500351 nargsmallest_numpy_argmin(items, n):
> 2.37499308586 nargsmallest_numpy_array_argsort(items, n):
> 0.524670124054 nargsmallest_numpy_array_argmin(items, n):
>
> 0.0525538921356 numpy argmin(items): 1892997
> 0.364673852921 min(items): 10.0000026786

I think without numpy, nsmallest_slott_bisect is almost optimal.  There
is a slight improvement:

1.33862709999 nsmallest_slott_bisect(items, n): [10.000011643188717, 10.000017791492528]
0.883894920349 nsmallest_slott_bisect2(items, n): [10.000011643188717, 10.000017791492528]

==== code ====

from bisect import insort
from itertools import islice

def nsmallest_slott_bisect(iterable, n, insort=insort):
    it   = iter(iterable)
    mins = sorted(islice(it, n))
    for el in it:
        if el <= mins[-1]: #NOTE: equal sign is to preserve duplicates
            insort(mins, el)
            mins.pop()
    return mins

def nsmallest_slott_bisect2(iterable, n, insort=insort):
    it   = iter(iterable)
    mins = sorted(islice(it, n))
    maxmin = mins[-1]
    for el in it:
        if el <= maxmin: #NOTE: equal sign is to preserve duplicates
            insort(mins, el)
            mins.pop()
            maxmin = mins[-1]
    return mins

import time
from random import randint, random

test_data = [randint(10, 50) + random() for i in range(10000000)]
K = 10

init = time.time()
mins = nsmallest_slott_bisect(test_data, K)
print time.time() - init, 'nsmallest_slott_bisect(items, n):', mins[:
2]

init = time.time()
mins = nsmallest_slott_bisect2(test_data, K)
print time.time() - init, 'nsmallest_slott_bisect2(items, n):', mins[:
2]

-- 
Arnaud



More information about the Python-list mailing list