Optimizing list processing

Thu Dec 12 22:01:37 EST 2013

On Thu, 12 Dec 2013 16:08:33 +0100, Peter Otten wrote:

> Steven D'Aprano wrote:
[...]
>> So, ideally I'd like to write my code like this:
>> 
>> 
>> table = [(x, i) for i,x in enumerate(iterable)] table.sort()
>> if len(table) < ?????:
>>     table = [i for x,i in table]
>> else:
>>     for x, i in table:
>>         table[i] = x
>> 
>> where ????? no doubt will depend on how much memory is available in one
>> contiguous chunk.
>> 
>> Is there any way to determine which branch I should run, apart from
>> hard- coding some arbitrary and constant cut-off value?
> 
> How about using two lists?
> 
> keys = list(iterable)
> values = range(len(keys))
> values.sort(key=keys.__getitem__)
> del keys
> 
> The intention is to save memory used for the 2-tuples; I don't know if
> they pop up elsewhere.

They don't. On the other hand, you've now got two lists where before I 
only had one.

In my post, I said that the code I showed was a simplification. What I 
actually have is two branches. If the given iterable is a sequence type 
(say, a list) I use something quite similar to your code above:

    if isinstance(iterable, collections.Sequence):
        getitem = iterable.__getitem__
        if key is not None:
            # Compose the given key with getitem.
            f = lambda i: key(getitem(i))
        else:
            f = getitem
        table = range(len(iterable))  # Python 2.x version
        table.sort(key=f, reverse=reverse)

which avoids any temporary two-tuples, and creates only the one 
additional list. I can't avoid making that list, since it's the result 
required. I don't think that this branch of the code can be optimized in 
any serious way, I can't see any additional fat to be trimmed off that. 
If the input list is so large as to cause thrashing, the only solution is 
to buy more memory.

The other branch handles iterators, and I have two versions. The 
idiomatic version is faster for small data sets (small in this case 
meaning "up to at least one million items on my computer") but ends up 
building at least one extra temporary list which is thrown away:

    else:
        table = [(x, i) for (i, x) in enumerate(iterable)]
        table.sort(key=key, reverse=reverse)
        table = [i for x, i in table]

I've experimented with slight variations, e.g. using sorted() and a 
generator expression instead of a list comp, and the differences are 
trivial. The important thing here is that there's a temporary list which 
in some cases is *very large* but destined to be thrown away. Having two 
very large lists (for large enough input sizes) causes thrashing. 

To avoid that thrashing, I have a second version which builds only one 
list and then modifies it in place. For very large input sizes (ten 
million items) this is significantly faster:

    else:
        table = [(x, i) for (i, x) in enumerate(iterable)]
        table.sort(key=key, reverse=reverse)
        for x, i in enumerate(table):
            table[i] = x

but much slower, and less idiomatic, for small input sizes.

I don't know of any reasonable way to tell at runtime which of the two 
algorithms I ought to take. Hard-coding an arbitrary value 
("if len(table) > 5000000") is not the worst idea I've ever had, but I'm 
hoping for something more deterministic which can be cheaply calculated 
at runtime.

-- 
Steven