[Numpy-discussion] Numpy speed ups to simple tasks - final findings and suggestions

Fri Jan 4 18:36:28 EST 2013

On 04/01/2013 2:33 PM, Nathaniel Smith wrote:
> On Fri, Jan 4, 2013 at 6:50 AM, Raul Cota <raul at virtualmaterials.com> wrote:
>> On 02/01/2013 7:56 AM, Nathaniel Smith wrote:
>>> But, it's almost certainly possible to optimize numpy's float64 (and
>>> friends), so that they are themselves (almost) as fast as the native
>>> python objects. And that would help all the code that uses them, not
>>> just the ones where regular python floats could be substituted
>>> instead. Have you tried profiling, say, float64 * float64 to figure
>>> out where the bottlenecks are?
>> Seems to be split between
>> - (primarily) the memory allocation/deallocation of the float64 that is
>> created from the operation float64 * float64. This is the reason why float64
>> * Pyfloat got improved with one of my changes because PyFloat was being
>> internally converted into a float64 before doing the multiplication.
>>
>> - the rest of the time is the actual multiplication path way.
> Running a quick profile on Linux x86-64 of
>    x = np.float64(5.5)
>    for i in xrange(n):
>       x * x
> I find that ~50% of the total CPU time is inside feclearexcept(), the
> function which resets the floating point error checking registers --
> and most of this is inside a single instruction, stmxcsr ("store sse
> control register").

I find strange you don't see bottleneck in allocation of a float64.

is it easy for you to profile this ?

x = np.float64(5.5)
y = 5.5
for i in xrange(n):
     x * y

numpy internally translates y into a float64 temporarily and then 
discards it and I seem to remember is a bit over two times slower than x * x

I will try to do your suggestions on

PyUFunc_clearfperr/PyUFunc_getfperror

and see what I get. Haven't gotten around to get going with being able 
to do a pull request for the previous stuff. if changes are worth while 
would it be ok if I also create one for this ?

Thanks again,

Raul

> It's possible that this is different on windows
> (esp. since apparently our fpe exception handling apparently doesn't
> work on windows[1]), but the total time you measure for both
> PyFloat*PyFloat and Float64*Float64 match mine almost exactly, so most
> likely we have similar CPUs that are doing a similar amount of work in
> both cases.
>
> The way we implement floating point error checking is basically:
>      PyUFunc_clearfperr()
>      <do the floating point operation>
>      if (PyUFunc_getfperror() & BAD_STUFF) {
>          <raise a warning or whatever>
>      }
>
> Some points that you may find interesting though:
>
> - The way we define these functions, both PyUFunc_clearfperr() and
> PyUFunc_getfperror() clear the flags. However, for PyUFunc_getfperror,
> this is just pointless. We could simply remove this, and expect to see
> a ~25% speedup in Float64*Float64 without any downside.
>
> - Numpy's default behaviour is to always check for an warn on floating
> point errors. This seems like it's probably the correct default.
> However, if you aren't worried about this for your use code, you could
> disable these warnings with np.seterr(all="ignore"). (And you'll get
> similar error-checking to what PyFloat does.) At the moment, that
> won't speed anything up. But we could easily then fix it so that the
> PyUFunc_clearfperr/PyUFunc_getfperror code checks for whether errors
> are ignored, and disables itself. This together with the previous
> change should get you a ~50% speedup in Float64*Float64, without
> having to change any of numpy's semantics.
>
> - Bizarrely, Numpy still checks the floating point flags on integer
> operations, at least for integer scalars. So 50% of the time in
> Int64*Int64 is also spent in fiddling with floating point exception
> flags. That's also some low-hanging fruit right there... (to be fair,
> this isn't *quite* as trivial to fix as it could be, because the
> integer overflow checking code sets the floating point unit's
> "overflow" flag to signal a problem, and we'd need to pull this out to
> a thread-local variable or something before disabling the floating
> point checks entirely in integer code. But still, not a huge problem.)
>
> -n
>
> [1] https://github.com/numpy/numpy/issues/2350
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>