Optimizing list processing

Thu Dec 12 13:40:36 EST 2013

On 12/12/2013 7:08 AM, Steven D'Aprano wrote:

> Please don't focus on the algorithm I gave. Focus on the fact that I
> could write it like this:
>
> if some condition to do with the computer's available memory:
>       make modifications in place
> else:
>       make a copy of the data containing the modifications
>
> if only I knew what that condition was and how to test for it.

In other words, you want a magic formula that depends on just about 
everything: the (static) memory in the machine, the other programs 
running, the memory otherwise used by the current program, the number 
items, the nature of the items, and possibly the memory fragmentation state.

> You may be right in this particular case, I don't know I haven't tried
> it, but there are cases where list comps are significantly faster than
> generator expressions.

Stefan noted that neither is needed if zip produces the items wanted.

> I'm not terribly fussed about micro-optimizations here. I'm concerned
> about *macro-optimizing* the case where creating two (or more) lists
> forces the computer to page memory in and out of virtual memory. Saving a
> few milliseconds? I don't care. Saving 5 or 6 seconds? That I care about.

Why would you spend an hour or more of your time to save 5 seconds? 
Theoretical interest or a practical problem with enough runs to turns 
seconds into hours or days? A non-trivial practical answer will require 
more info about the practical context, including the distribution of 
machine memory sizes and of problem sizes.

> I don't want to pre-allocate the list comp. I want to avoid having to
> have two LARGE lists in memory at the same time, one containing the
> decorated values, and one not. I know that this is a good optimization
> for large enough lists. On my computer, ten million items is sufficient
> to demonstrate the optimization, and with sufficient testing I could
> determine a rough cut-off value, below which list comps are more
> efficient and above which in-place modifications are better.
>
> But I don't know how to generalize that cut-off value. If I buy extra
> RAM, the cut-off value will shift. If I run it on another computer, it
> will shift.

The simplest answer is to always avoid catastrophe by always modifying 
the one list. For 'short' lists, the time difference will be relatively 
small.

If the fraction of short lists is large enough to make the cumulative 
time difference large enough, and the list object sizes are more or less 
constant or at least bounded, a possible answer is to pick a minimum 
memory size, get a machine with that size, and work out a list size 
small enough to avoid thrashing.

-- 
Terry Jan Reedy