Testing for performance regressions

Tue Apr 5 03:14:12 EDT 2011

On Mon, Apr 4, 2011 at 10:25 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Mon, 04 Apr 2011 20:59:52 -0700, geremy condra wrote:
>
>> On Mon, Apr 4, 2011 at 7:45 PM, Steven D'Aprano
>> <steve+comp.lang.python at pearwood.info> wrote:
>
>>> * The disclaimers about timing code snippets that can be found in the
>>> timeit module apply. If possible, use timeit rather than roll-you-own
>>> timers.
>>
>> Huh. In looking into timing attacks actually one of the biggest lessons
>> I learned was *not* to use timeit- that the overhead and variance
>> involved in using it will wind up consuming small changes in behavior in
>> ways that are fairly opaque until you really take it apart.
>
> Do you have more details?
>
> I follow the advice in the timeit module, and only ever look at the
> minimum value, and never try to calculate a mean or variance. Given the
> number of outside influences ("What do you mean starting up a browser
> with 200 tabs at the same time will affect the timing?"), I wouldn't
> trust a mean or variance to be meaningful.

I think it's safe to treat timeit as an opaque, medium-precision
benchmark with those caveats. If you need actual timing data though-
answering the question 'how much faster?' rather than 'which is
faster?' just taking actual timings seems to provide much, much better
answers. Part of that is because timeit adds the cost of the for loop
to every run- here's the actual code:

def inner(_it, _timer):
    %(setup)s
    _t0 = _timer()
    for _i in _it:
        %(stmt)s
    _t1 = _timer()
    return _t1 - _t0

(taken from Lib/timeit.py line 81)

where %(setup)s and %(stmt)s are what you passed in. Obviously, if the
magnitude of the change you're looking for is smaller than the
variance in the for loop's overhead this makes things a lot harder
than they need to be, and the whole proposition gets pretty dodgy for
measuring in the sub-millisecond range, which is where many timing
attacks are going to lay. It also has some problems at the opposite
end of the spectrum- timing large, long-running, or memory-intensive
chunks of code can be deceptive because timeit runs with the GC
disabled. This bit me a while back working on Graphine, actually, and
it confused the hell out of me at the time.

I'm also not so sure about the 'take the minimum' advice. There's a
reasonable amount of empirical evidence suggesting that timings taken
at the 30-40% mark are less noisy than those taken at either end of
the spectrum, especially if there's a network performance component.
YMMV, of course.

Geremy Condra