efficient partial sort in Python ?

Dan Stromberg drsalists at gmail.com
Tue Aug 19 19:05:51 EDT 2014


On Tue, Aug 19, 2014 at 12:37 PM, Chiu Hsiang Hsu <wdv4758h at gmail.com> wrote:
> On Tuesday, August 19, 2014 5:42:27 AM UTC+8, Dan Stromberg wrote:
>> On Mon, Aug 18, 2014 at 10:18 AM, Chiu Hsiang Hsu <wdv4758h at gmail.com> wrote:
>>
>> > I know that Python use Timsort as default sorting algorithm and it is efficient,
>>
>> > but I just wanna have a partial sorting (n-largest/smallest elements).
>>
>>
>>
>> Perhaps heapq with Pypy?  Or with nuitka?  Or with numba?

> Another problem with heapq is the memory usage, it cost a lot of more memory with heapq in CPython (I test it in 3.4 with 1000000 float numbers) compare to sorted.

This surprises me.  I believe heapq probably keeps values in a python
list with no extra references, by making node i's left child and right
child be array elements 2*i and 2*i+1, respectively.

A heap of some sort probably is best algorithmically.  You're probably
just up against a high constant.  On the other hand, there are many
kinds of heaps.

> For curiosity, there are many speed up solution in Python (like Cython, PyPy), I hasn't use Cython before,
> I guess PyPy is a more convient way to speed up current Python code (?),
> so how does Cython compare to PyPy ? (speed, code, flexibility, or anything else)

PyPy is really fast for CPU-intensive workloads, but CPython is better for I/O.

I tested a single CPU-intensive microbenchmark of Cython and PyPy
(also Jython and CPython).  PyPy was fastest
(http://stromberg.dnsalias.org/~strombrg/backshift/documentation/performance/index.html).

I haven't yet compared numba or nuitka or Shedskin.

When you use heapq, are you putting all the values in the heap, or
just up to n at a time (evicting the worst value, one at a time as you
go)?  If you're doing the former, it's basically a heapsort which
probably won't beat timsort.  If you're doing the latter, that should
be pretty good.



More information about the Python-list mailing list