[SciPy-User] get best few of many: argsort( few= ) using std::partial_sort ?
Keith Goodman
kwgoodman at gmail.com
Tue May 3 12:38:00 EDT 2011
On Thu, Jul 8, 2010 at 10:17 AM, denis <denis-bz-gg at t-online.de> wrote:
> Folks,
> to get the best few of a large number of objects,
> e.g. vectors near a given one, or small distances in
> spatial.distance.cdist or .pdist,
> argsort( bigArray )[: a few ] is not so hot. It would be nice if
> argsort( bigArray, few= )
> did this -- faster, save mem too. Would anyone else find this useful ?
>
> I recently stumbled across partial_sort in stl; fwiw,
> std:partial_sort( A, A + sqrt(N), A + N ) is ~ 10 times faster than
> std:sort
> on my old mac ppc, even for N 100.
> Also fwiw, nth_element alone is ~ twice as slow as partial_sort --
> odd.
I recently added partial sorting to bottleneck (0.5.0dev): partsort()
and argpartsort().
Make an array:
>> import bottleneck as bn
>> a = np.random.rand(1000000)
Timing:
>> timeit np.sort(a)
10 loops, best of 3: 114 ms per loop
>> timeit bn.partsort(a, n=1000)
100 loops, best of 3: 3.95 ms per loop
>> timeit np.argsort(a)
10 loops, best of 3: 161 ms per loop
>> timeit bn.argpartsort(a, n=1000)
100 loops, best of 3: 11.4 ms per loop
The speed up is more like a factor 2 to 5 for 2d input and n =
a.shape[axis] / 2.
You mentioned memory usage. I can reduce that from my current
implementation by one copy of the input array if it is a problem.
bn.partsort doc string:
partsort(arr, n, axis=-1)
Partial sorting of array elements along given axis.
A partially sorted array is one in which the `n` smallest values appear
(in any order) in the first `n` elements. The remaining largest elements
are also unordered. Due to the algorithm used (Wirth's method), the nth
smallest element is in its sorted position (at index `n-1`).
Shuffling the input array may change the output. The only guarantee is
that the first `n` elements will be the `n` smallest and the remaining
element will appear in the remainder of the output.
This functions is not protected against NaN. Therefore, you may get
unexpected results if the input contains NaN.
Parameters
----------
arr : array_like
Input array. If `arr` is not an array, a conversion is attempted.
n : int
The `n` smallest elements will appear (unordered) in the first `n`
elements of the output array.
axis : {int, None}, optional
Axis along which the partial sort is performed. The default (axis=-1)
is to sort along the last axis.
Returns
-------
y : ndarray
A partially sorted copy of the input array where the `n` smallest
elements will appear (unordered) in the first `n` elements.
See Also
--------
bottleneck.argpartsort: Indices that would partially sort an array
Notes
-----
Unexpected results may occur if the input array contains NaN.
Examples
--------
Create a numpy array:
>>> a = np.array([1, 0, 3, 4, 2])
Partially sort array so that the first 3 elements are the smallest 3
elements (note, as in this example, that the smallest 3 elements may not
be sorted):
>>> bn.partsort(a, n=3)
array([1, 0, 2, 4, 3])
More information about the SciPy-User
mailing list