Optimizing list processing

Thu Dec 12 07:08:58 EST 2013

On Wed, 11 Dec 2013 21:26:51 -0500, Terry Reedy wrote:

> On 12/11/2013 6:54 PM, Steven D'Aprano wrote:
>> I have some code which produces a list from an iterable using at least
>> one temporary list, using a Decorate-Sort-Undecorate idiom.
> 
> It is a non-standard version thereof, as DSU usually means to decorate
> with a key that gets discarded.

I do decorate with a key that gets discarded. The actual values that are 
being sorted are the enumerated values 0, 1, 2, ... and the temporary key 
values come from the iterable.

Confused yet? :-)

> A couple of examples of input and expected output would have been good
> ;-).

Only if I wanted you to understand the algorithm, which isn't really 
important. My question is about swapping between in-place modifications 
and external copying. If I wanted to be really mysterious (annoying), I 
could have just give pseudo-code :-)

Please don't focus on the algorithm I gave. Focus on the fact that I 
could write it like this:

if some condition to do with the computer's available memory:
     make modifications in place
else:
     make a copy of the data containing the modifications

if only I knew what that condition was and how to test for it.

>> The algorithm looks something like this (simplified):
>>
>> table = sorted([(x, i) for i,x in enumerate(iterable)])
> 
> This makes two temporaries when only one is needed, and shows why we
> added generator expressions.

You may be right in this particular case, I don't know I haven't tried 
it, but there are cases where list comps are significantly faster than 
generator expressions. For example, when passed to str.join, list comps 
are (on my computer) consistently 15% faster:

[steve at ando ~]$ python3.3 -m timeit "''.join([chr(n) for n in range(1, 
201)])"
10000 loops, best of 3: 100 usec per loop
[steve at ando ~]$ python3.3 -m timeit "''.join([chr(n) for n in range(1, 
201)])"
10000 loops, best of 3: 94.1 usec per loop
[steve at ando ~]$ python3.3 -m timeit "''.join(chr(n) for n in range(1, 
201))"
10000 loops, best of 3: 116 usec per loop
[steve at ando ~]$ python3.3 -m timeit "''.join(chr(n) for n in range(1, 
201))"
10000 loops, best of 3: 109 usec per loop

> table = sorted((x, i) for i,x in enumerate(iterable)) is equivalent to
> the table; table.sort lines below.
> 
> The following (untested) should be faster as it avoids tuple unpacking
> and repacking.

I'm not terribly fussed about micro-optimizations here. I'm concerned 
about *macro-optimizing* the case where creating two (or more) lists 
forces the computer to page memory in and out of virtual memory. Saving a 
few milliseconds? I don't care. Saving 5 or 6 seconds? That I care about.

[...]
> I expect the list comp be faster than in-place as long as the result
> list can be allocated and held in memory without paging. (This of course
> depends on system memory and other memory uses.)  

Exactly!

> List iterators have a
> __length_hint__ method giving the length of the underlying list, so the
> list comp result list can potentially be allocated once and then filled
> in by enumeration and replacement, but in C rather than Python code.

I don't want to pre-allocate the list comp. I want to avoid having to 
have two LARGE lists in memory at the same time, one containing the 
decorated values, and one not. I know that this is a good optimization 
for large enough lists. On my computer, ten million items is sufficient 
to demonstrate the optimization, and with sufficient testing I could 
determine a rough cut-off value, below which list comps are more 
efficient and above which in-place modifications are better.

But I don't know how to generalize that cut-off value. If I buy extra 
RAM, the cut-off value will shift. If I run it on another computer, it 
will shift.

P.S. The algorithm I'm working on is a way of generating index and rank 
tables. Not that it really matters -- what matters is determining whether 
or not to shift from "make a copy of the list" to "modify the list in 
place".

-- 
Steven