[Numpy-discussion] Slicing slower than matrix multiplication?

Fri Dec 11 11:36:54 EST 2009

On 12/11/2009 10:03 AM, Francesc Alted wrote:
> A Friday 11 December 2009 16:44:29 Dag Sverre Seljebotn escrigué:
>    
>> Jasper van de Gronde wrote:
>>      
>>> Dag Sverre Seljebotn wrote:
>>>        
>>>> Jasper van de Gronde wrote:
>>>>          
>>>>> I've attached a test file which shows the problem. It also tries adding
>>>>> columns instead of rows (in case the memory layout is playing tricks),
>>>>> but this seems to make no difference. This is the output I got:
>>>>>
>>>>>      Dot product: 5.188786
>>>>>      Add a row: 8.032767
>>>>>      Add a column: 8.070953
>>>>>
>>>>> Any ideas on why adding a row (or column) of a matrix is slower than
>>>>> computing a matrix product with a similarly sized matrix... (Xi has
>>>>> less columns than Xi2, but just as many rows.)
>>>>>            
>>>> I think we need some numbers to put this into context -- how big are the
>>>> vectors/matrices? How many iterations was the loop run? If the vectors
>>>> are small and the loop is run many times, how fast the operation "ought"
>>>> to be is irrelevant as it would drown in Python overhead.
>>>>          
>>> Originally I had attached a Python file demonstrating the problem, but
>>> apparently this wasn't accepted by the list. In any case, the matrices
>>> and vectors weren't too big (60x20), so I tried making them bigger and
>>> indeed the "fast" version was now considerably faster.
>>>        
>> 60x20 is "nothing", so a full matrix multiplication or a single
>> matrix-vector probably takes the same time (that is, the difference
>> between them in itself is likely smaller than the error you make during
>> measuring).
>>
>> In this context, the benchmarks will be completely dominated by the
>> number of Python calls you make (each, especially taking the slice,
>> means allocating Python objects, calling a bunch of functions in C, etc.
>> etc). So it's not that strange, taking a slice isn't free, some Python
>> objects must be created etc. etc.
>>      
> Yeah, I think taking slices here is taking quite a lot of time:
>
> In [58]: timeit E + Xi2[P/2,:]
> 100000 loops, best of 3: 3.95 µs per loop
>
> In [59]: timeit E + Xi2[P/2]
> 100000 loops, best of 3: 2.17 µs per loop
>
> don't know why the additional ',:' in the slice is taking so much time, but my
> guess is that passing&  analyzing the second argument (slice(None,None,None))
> could be the responsible for the slowdown (but that is taking too much time).
> Mmh, perhaps it would be worth to study this more carefully so that an
> optimization could be done in NumPy.
>
>    
>> I think the lesson mostly should be that with so little data,
>> benchmarking becomes a very difficult art.
>>      
> Well, I think it is not difficult, it is just that you are perhaps
> benchmarking Python/NumPy machinery instead ;-)  I'm curious whether Matlab
> can do slicing much more faster than NumPy.  Jasper?
>
>    
What are using actually trying to test here?
I do not see any equivalence in the operations or output here.
-With your slices you need two dot products but ultimately you are only 
using one for your dot product.
-There are addition operations on the slices that are not present in the 
dot product.
-The final E arrays are not the same for all three operations.

Having said that, the more you can vectorize your function, the more 
efficient it will likely be especially with Atlas etc.

Bruce