[SciPy-User] get best few of many: argsort( few= ) using std::partial_sort ?

Keith Goodman kwgoodman at gmail.com
Tue May 3 12:38:00 EDT 2011


On Thu, Jul 8, 2010 at 10:17 AM, denis <denis-bz-gg at t-online.de> wrote:
> Folks,
>  to get the best few of a large number of objects,
> e.g. vectors near a given one, or small distances in
> spatial.distance.cdist or .pdist,
> argsort( bigArray )[: a few ] is not so hot.  It would be nice if
>    argsort( bigArray, few= )
> did this -- faster, save mem too. Would anyone else find this useful ?
>
> I recently stumbled across partial_sort in stl; fwiw,
> std:partial_sort( A, A + sqrt(N), A + N ) is ~ 10 times faster than
> std:sort
> on my old mac ppc, even for N 100.
> Also fwiw, nth_element alone is ~ twice as slow as partial_sort --
> odd.

I recently added partial sorting to bottleneck (0.5.0dev): partsort()
and argpartsort().

Make an array:

    >> import bottleneck as bn
    >> a = np.random.rand(1000000)

Timing:

    >> timeit np.sort(a)
    10 loops, best of 3: 114 ms per loop
    >> timeit bn.partsort(a, n=1000)
    100 loops, best of 3: 3.95 ms per loop

    >> timeit np.argsort(a)
    10 loops, best of 3: 161 ms per loop
    >> timeit bn.argpartsort(a, n=1000)
    100 loops, best of 3: 11.4 ms per loop

The speed up is more like a factor 2 to 5 for 2d input and n =
a.shape[axis] / 2.

You mentioned memory usage. I can reduce that from my current
implementation by one copy of the input array if it is a problem.

bn.partsort doc string:

    partsort(arr, n, axis=-1)

    Partial sorting of array elements along given axis.

    A partially sorted array is one in which the `n` smallest values appear
    (in any order) in the first `n` elements. The remaining largest elements
    are also unordered. Due to the algorithm used (Wirth's method), the nth
    smallest element is in its sorted position (at index `n-1`).

    Shuffling the input array may change the output. The only guarantee is
    that the first `n` elements will be the `n` smallest and the remaining
    element will appear in the remainder of the output.

    This functions is not protected against NaN. Therefore, you may get
    unexpected results if the input contains NaN.

    Parameters
    ----------
    arr : array_like
        Input array. If `arr` is not an array, a conversion is attempted.
    n : int
        The `n` smallest elements will appear (unordered) in the first `n`
        elements of the output array.
    axis : {int, None}, optional
        Axis along which the partial sort is performed. The default (axis=-1)
        is to sort along the last axis.

    Returns
    -------
    y : ndarray
        A partially sorted copy of the input array where the `n` smallest
        elements will appear (unordered) in the first `n` elements.

    See Also
    --------
    bottleneck.argpartsort: Indices that would partially sort an array

    Notes
    -----
    Unexpected results may occur if the input array contains NaN.

    Examples
    --------
    Create a numpy array:

    >>> a = np.array([1, 0, 3, 4, 2])

    Partially sort array so that the first 3 elements are the smallest 3
    elements (note, as in this example, that the smallest 3 elements may not
    be sorted):

    >>> bn.partsort(a, n=3)
    array([1, 0, 2, 4, 3])



More information about the SciPy-User mailing list