[Numpy-discussion] optimizing power() for complex and real cases

Tue Feb 21 15:43:04 EST 2006

Tim Hochberg <tim.hochberg at cox.net> writes:

> David M. Cooke wrote:
>
>>On Sat, Feb 18, 2006 at 06:17:47PM -0700, Tim Hochberg wrote:
>>
>>
>>>OK, I now have a faily clean implementation in C of:
>>>
>>>def __pow__(self, p):
>>>   if p is not a scalar:
>>>      return power(self, p)
>>>   elif p == 1:
>>>      return p
>>>   elif p == 2:
>>>      return square(self)
>>>#    elif p == 3:
>>>#       return cube(self)
>>>#    elif p == 4:
>>>#       return power_4(self)
>>>#    elif p == 0:
>>>#       return ones(self.shape, dtype=self.dtype)
>>>#    elif p == -1:
>>>#       return 1.0/self
>>>   elif p == 0.5:
>>>      return sqrt(self)

I've gone through your code you checked in, and fixed it up. Looks
good. One side effect is that

def zl(x):
    a = ones_like(x)
    a[:] = 0
    return a

is now faster than zeros_like(x) :-)

One problem I had is that in PyArray_SetNumericOps, the "copy" method
wasn't picked up on. It may be due to the order of initialization of
the ndarray type, or something (since "copy" isn't a ufunc, it's
initialized in a different place). I couldn't figure out how to fiddle
that, so I replaced the x.copy() call with a call to PyArray_Copy().

>>Yes; because it's the implementation of __pow__, the second argument can
>>be anything.
>>
>>
> No, you misunderstand.. What I was talking about was that the *first*
> argument can also be something that's not a PyArrayObject, despite the
> functions signature.

Ah, I suppose that's because the power slot in the number protocol
also handles __rpow__.

>>> On the other hand, real powers are fast enough that doing anything
>>> at the single element level is unlikely to help. So in that case
>>> we're left with either optimizing the cases where the dimension is
>>> zero as David has done, or optimizing at the __pow__ (AKA
>>> array_power) level as I've done now based on David's original
>>> suggestion. This second approach is faster because it avoids the
>>> mysterious python scalar -> zero-D array conversion overhead.
>>> However, it suffers if we want to optimize lots of different powers
>>> since one needs a ufunc for each one. So the question becomes,
>>> which powers should we optimize?
>>
>>Hmm, ufuncs are passed a void* argument for passing info to them. Now,
>>what that argument is defined when the ufunc is created, but maybe
>>there's a way to piggy-back on it.
>>
>>
> Yeah, I really felt like I was fighting the ufuncs when I was playing
> with this. On the one hand, you really want to use the ufunc
> machinery. On the other hand that forces you into using the same types
> for both arguments. That really wouldn't be a problem, since we could
> just define an integer_power that took doubles, but did integer
> powers, except for the conversion overhead of Python_Integers into
> arrays. It looks like you started down this road and I played with
> this as well.   I can think a of at least one (horrible) way around
> the matrix overhead, but the real fix would be to dig into
> PyArray_EnsureArray and see why it's slow for Python_Ints. It is much
> faster for numarray scalars.

Right; that needs to be looked at.

> Another approach is to actually compute (x*x)*(x*x) for pow(x,4) at
> the level of array_power. I think I could make this work. It would
> probably work well for medium size arrays, but might well make things
> worse for large arrays that are limited by memory bandwidth since it
> would need to move the array from memory into the cache multiple times.

I don't like that; I think it would be better memory-wise to do it
elementwise. Not just speed, but size of intermediate arrays.

>>> My latest thinking on this is that we should optimize only those
>>> cases where the optimized result is no less accurate than that
>>> produced by pow. I'm going to assume that all C operations are
>>> equivalently accurate, so pow(x,2) has roughly the same amount of
>>> error as x*x. (Something on the order of 0.5 ULP I'd guess). In
>>> that case:
>>>  pow(x, -1) -> 1 / x
>>>  pow(x, 0) -> 1
>>>  pow(x, 0.5) -> sqrt(x)
>>>  pow(x, 1) -> x
>>>  pow(x, 2) -> x*x
>>> can all be implemented in terms of multiply or divide with the same
>>> accuracy as the original power methods. Once we get beyond these,
>>> the error will go up progressively.
>>>
>>> The minimal set described above seems like it should be relatively
>>> uncontroversial and it's what I favor. Once we get beyond this
>>> basic set, we would need to reach some sort of consensus on how
>>> much additional error we are willing to tolerate for optimizing
>>> these extra cases. You'll notice that I've changed my mind, yet
>>> again, over whether to optimize A**0.5. Since the set of additional
>>> ufuncs needed in this case is relatively small, just square and
>>> inverse (==1/x), this minimal set works well if optimizing in pow
>>> as I've done.
>>>
>>>
>
> Just to add a little more confusion to the mix. I did a little testing
> to see how close pow(x,n) and x*x*... actually are. They are slightly
> less close for small values of N and slightly closer for large values
> of N than I would have expected. The upshot of this is that integer
> powers between  -2 and +4 all seem to vary by the same amount when
> computed using pow(x,n) versus multiplies. I'm including the test code
> at the end. Assuming that this result is not a fluke that expands the
> noncontroversial set by at least 3 more values. That's starting to
> strain the ufunc aproach, so perhaps optimizing in @TYP at _power is the
> way to go after all. Or,  more likely, adding @TYP at _int_power or maybe
> @TYP at _fast_power (so as to be able to include some half integer
> powers) and dispatching appropriately from array_power.

'int_power' we could do; that would the next step I think. The half
integer powers we could maybe leave; if you want x**(-3/2), for
instance, you could do y = x**(-1)*sqrt(x) (or do y = x**(-1);
sqrt(y,y) if you're worried about temporaries).

Or, 'fast_power' could be documented as doing the optimizations for
integer and half-integer _scalar_ exponents, up to a certain size,
like 100), and falling back on pow() if necessary. I think we could do
a precomputation step to split the exponent into appropiate squarings
and such that'll make the elementwise loop faster. Half-integer
exponents are exactly representable as doubles (up to some number of
course), so there's no chance of decimal-to-binary conversions making
things look different. That might work out ok. Although, at that point
I'd suggest we make it 'power', and have 'rawpower' (or ????) as the
version that just uses pow().

Another point is to look at __div__, and use reciprocal if the
dividend is 1.

> The problem here, of course, is the overhead that PyArray_EnsureArray
> runs into. I'm not sure if the ufuncs actually call that, but I was
> using that to convert things to arrays at one point and I saw the
> slowdown, so I suspect that the slowdown is in something
> PyArray_EnsureArray calls if not in that routine itself. I'm afraid to
> dig into that stuff though.. On the other hand, it would probably
> speed up all kinds of stuff if that was sped up.

I've added a page to the developer's wiki at
http://projects.scipy.org/scipy/numpy/wiki/PossibleOptimizationAreas
to keep a list of areas like that to look into if someone has time :-)

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke                      http://arbutus.physics.mcmaster.ca/dmc/
|cookedm at physics.mcmaster.ca