Benchmarking some modules - strange result

Sat Jan 24 23:07:54 EST 2015

Dan Stromberg wrote:

> Hi folks.
> 
> I've been benchmarking some python modules that are mostly variations
> on the same theme.
> 
> For simplicity, let's say I've been running the suite of performance
> tests within a single interpreter - so I test one module thoroughly,
> then move on to the next without exiting the interpreter.

In principle, that could mean that you have more available memory when the
first module runs than when the last one runs. If one module leaves behind
a large data structure, that might make it harder to allocate memory for
the modules that follow.

> I'm finding that if I prune the list of modules down to just the best
> performers, I get pretty different results - what was best no longer
> is.  This strikes me as strange.

Without seeing exactly what is going on, I cannot comment except to say that
on modern computers, timing results are always strange :-)

There are just too many variables to make sense of what is going on. Do the
modules create large data structures which may still be hanging around from
one run to the next? Are you modifying the code in the source file and then
using reload()? Are you re-using instances created before the reload?

How are you timing your code? Are you timing fast operations or slow
operations? The right way to do things will depend on whether you are
trying to time a fast operation that completes in a few milliseconds, or a
slow one that takes a minute or two.

Is garbage collection enabled or disabled during the tests?

Ideally, you should try to eliminate as much of that complexity as possible.
Set up a single set of tests as a script, then run the script. Make your
changes to the modules, then run the script again.

> I'm forcing a garbage collection between tests with gc.collect(), but
> even that yields the oddball results.

Just because you run the gc between tests doesn't necessarily mean it won't
run in the middle of a test too.

> Is there something else I should do to restore the interpreter to a
> known-pristine state for the sake of such tests?

Yes. Kill the interpreter and re-start it.

> BTW, there isn't much else going on on this computer, except some
> possible small cronjobs.

There's *always* something else going on. Here's my computer right now:

[steve at ando ~]$ ps aux | wc -l  # count how many processes are active
272

And running top:

top - 14:45:10 up 85 days, 14:44, 25 users,  load average: 0.08, 0.19, 0.26
Tasks: 270 total,   1 running, 269 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.3%us,  2.7%sy,  0.0%ni, 87.4%id,  3.6%wa,  0.5%hi,  1.5%si, 
0.0%st
Mem:   3106896k total,  3001412k used,   105484k free,    61864k buffers
Swap:  4192824k total,  1305304k used,  2887520k free,   408776k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28392 steve     15   0 2407m 2.0g  14m S  7.6 69.1  12:40.48 opera
 3303 root      15   0  478m  23m 2084 S  2.3  0.8   4900:02 Xorg
 2654 root      15   0     0    0    0 S  1.7  0.0  33:51.45 nfsd
  426 root      10  -5     0    0    0 S  1.3  0.0 279:37.70 md3_raid1
 2651 root      16   0     0    0    0 S  0.7  0.0  33:36.65 nfsd
 2652 root      15   0     0    0    0 S  0.7  0.0  33:38.93 nfsd
 3525 steve     15   0 1035m 8532 4864 S  0.7  0.3 374:58.94 swriter.bin
28897 steve     15   0  2444 1144  808 R  0.7  0.0   0:00.13 top
 2467 root      21   0 23220  652  512 S  0.3  0.0   4:00.86 pcscd
 2653 root      15   0     0    0    0 S  0.3  0.0  33:13.85 nfsd
 4251 steve     15   0 36360 7504 4512 S  0.3  0.2   1:00.62 konsole
26841 annie     15   0  2744 1032  916 S  0.3  0.0   0:00.12 imap
    1 root      15   0  2176  412  388 S  0.0  0.0   0:02.65 init
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:01.49 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:18.77 ksoftirqd/0
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:01.19 migration/1
    6 root      34  19     0    0    0 S  0.0  0.0   0:21.81 ksoftirqd/1
    7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/1
[...]

Even with a load average of 0.08 (practically idling), I have nearly 300
running processes. Most of them are sleeping at any specific instance, but
the OS is constantly waking them, letting them run for a fraction of a
second, then putting them back to sleep. The above snapshot of my system
just happened to catch the system at the moment that only top itself was
active (or rather, top's output was generated at a moment when only top was
active), but that's just luck. If you sit and watch top for a minute or so,
with your hands off the keyboard and mouse so you aren't changing anything,
you will see the number of running processes go up and down, and even the
number of processes going up and down.

And remember, top only updates every two seconds or so, so you're only
seeing a fraction of the activity that is going on. Most of the activity
falls in between the snapshots.

> I'm about ready to rewrite things to run each individual test in a
> fresh interpreter. But is there a better way?
> 
> If I do the rewrite, I might run 1 run of module A, 1 run of module B,
> 1 run of module C, then another run of module A, another run of module
> B, and another run of module C - to spread any possible timing
> oddities more evenly.

I think that may be counter-productive. It means that any processor cache
issues etc. are more likely to occur, not less. I'm not an expert, but as I
understand it modern processors have caches to hold frequently used code,
and they are pretty good at predicting what code will be reused, but it
takes a while for the predictions to get good.

What that means from the point of view of timing Python code is that you
shouldn't do this, since you are defeating the CPU's attempts to predict
commonly used code:

# Time each function running 1000 times.
results = {}
for i in range(1000):
    for func in list_of_funcs:
        name = func.__name__
        t = time.time()
        x = func()
        t = time.time() - t
        results[name] = results.get(name, 0) + t

This results in running code A, B, C, D, A, B, C, D, ... which vastly
reduces the chances of code prediction and CPU caches being helpful. It
makes the times taken (1) longer and (2) more variable. But if you run your
tests A, A, A, ... A, B, B, B, ... B, C, C, C, ... you are working with the
CPU instead of against it, and timings may be less variable and faster.

As I said, I'm not an expert on modern CPUs, and everything I have just said
may have been utter nonsense :-)

-- 
Steven