[Numpy-discussion] -ffast-math

Sun Dec 1 18:01:31 EST 2013

Julian Taylor <jtaylor.debian <at> googlemail.com> writes:

> 
> On 01.12.2013 22:59, Dan Goodman wrote:
> > Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> >> your sin and exp calls are loop invariants, they do not depend on the
> >> loop iterable.
> >> This allows to move the expensive functions out of the loop and only
> >> leave some simple arithmetic in the body.
> > 
> > Ahhhh! I feel extremely stupid for not realising this! Thanks Julian.
> > 
> > Any thoughts on why using -ffast-math it actually goes slower for just doing
> > sin(x)?
> > 
> 
> no on my linux machine ffast-math is a little faster:
> numpy: 311 ms
> weave_slow: 291 ms
> weave_fast: 262 ms

Maybe something to do with my older version of gcc (4.5)?

> here is a pure numpy version of your calculation which only performs 3
> times worse than weave:
> 
> def timefunc_numpy2(a, v):
>     ext = exp(-dt/tau)
>     sit = sin(2.0*freq*pi*t)
>     bs = 20000
>     for i in range(0, N, bs):
>         ab = a[i:i+bs]
>         vb = v[i:i+bs]
>         absit = ab*sit + b
>         vb *= ext
>         vb += absit
>         vb -= absit*ext
> 
> it works by replacing temporaries with inplace operations and blocks the
> operations to be more memory cache friendlier.
> using numexpr should give you similar results.

I was working on something similar without the blocking and also got good
results. Actually, your version with blocking doesn't give me as good
performance on my machine, it's around 6x slower than weave. I tried
different sizes for the block size but couldn't improve much on that. Using
this unblocked code:

def timefunc_numpy_smart():
    _sin_term = sin(2.0*freq*pi*t)
    _exp_term = exp(-dt/tau)
    _a_term = (_sin_term-_sin_term*_exp_term)
    _v = v
    _v *= _exp_term
    _v += a*_a_term
    _v += -b*_exp_term + b

I got around 5x slower. Using numexpr 'dumbly' (i.e. just putting the
expression in directly) was slower than the function above, but doing a
hybrid between the two approaches worked well:

def timefunc_numexpr_smart():
    _sin_term = sin(2.0*freq*pi*t)
    _exp_term = exp(-dt/tau)
    _a_term = (_sin_term-_sin_term*_exp_term)
    _const_term = -b*_exp_term + b
    v[:] = numexpr.evaluate('a*_a_term+v*_exp_term+_const_term')
    #numexpr.evaluate('a*_a_term+v*_exp_term+_const_term', out=v)

This was about 3.5x slower than weave. If I used the commented out final
line then it was only 1.5x slower than weave, but it also gives wrong
results. I reported this as a bug in numexpr a long time ago but I guess it
hasn't been fixed yet (or maybe I didn't upgrade my version recently).

Dan