Optimizing list processing

Terry Reedy tjreedy at udel.edu
Wed Dec 11 21:26:51 EST 2013


On 12/11/2013 6:54 PM, Steven D'Aprano wrote:
> I have some code which produces a list from an iterable using at least
> one temporary list, using a Decorate-Sort-Undecorate idiom.

It is a non-standard version thereof, as DSU usually means to decorate 
with a key that gets discarded.

A couple of examples of input and expected output would have been good ;-).

> The algorithm looks something like this (simplified):
>
> table = sorted([(x, i) for i,x in enumerate(iterable)])

This makes two temporaries when only one is needed, and shows why we 
added generator expressions.

table = sorted((x, i) for i,x in enumerate(iterable))
is equivalent to the table; table.sort lines below.

The following (untested) should be faster as it avoids tuple unpacking 
and repacking.

from itertools import count
table = sorted(t for t in zip(iterable, count))

 > table = [i for x,i in table]

Did your original un-simplified use zip instead enumerate? Table now has 
the original index of the items in iterable sorted by the items value. 
The use for this is not obvious.

> The problem here is that for large iterables, say 10 million items or so,
> this is *painfully* slow, as my system has to page memory like mad to fit
> two large lists into memory at once. So I came up with an in-place
> version that saves (approximately) two-thirds of the memory needed.

With 1/3 saved by using a genex, it saves 1/2 of the remainder.

> table = [(x, i) for i,x in enumerate(iterable)]
> table.sorted()

(You meant table.sort().)  This is an expansion of sorted(genex). It 
might be slightly faster as list comp may be faster than list(equivalent 
genex).

> for x, i in table:
>      table[i] = x

I cannot understand what you are aiming for here. Besides the fact that 
this does not work, it keeps x values rather than i indexes as before.

 > table = [i for x,i in table]

done in place is

for j, (x,i) in enumerate(table):
   table[j] = i

I expect the list comp be faster than in-place as long as the result 
list can be allocated and held in memory without paging. (This of course 
depends on system memory and other memory uses.)  List iterators have a 
__length_hint__ method giving the length of the underlying list, so the 
list comp result list can potentially be allocated once and then filled 
in by enumeration and replacement, but in C rather than Python code.

-- 
Terry Jan Reedy




More information about the Python-list mailing list