[Numpy-discussion] Numpy speed ups to simple tasks - final findings and suggestions

Fri Dec 21 14:20:58 EST 2012

Hello,

On Dec/2/2012 I sent an email about some meaningful speed problems I was 
facing when porting our core program from Numeric (Python 2.2) to Numpy 
(Python 2.6). Some of our tests went from 30 seconds to 90 seconds for 
example.

I saw interest from some people in this list and I left the topic saying 
I would do a more complete profile of the program and report back 
anything meaningful. It took me quite a bit to get through things 
because I ended up having to figure out how to create a Visual Studio 
project that I could debug and compile from the IDE.

First, the obvious,
Everything that relies heavily on Numpy for speed (mid to large arrays) 
is pretty much the same speed when compared to Numeric. The areas that 
are considerably slower in Numpy Vs Numeric are the trivial tasks that 
we end up using either for convenience (small arrays) or because scalar 
types such as 'float64' propagate everywhere throughout the program and 
creep into several of our data structures.

This email is really only relevant to people stuck with doing trivial 
operations with Numpy and want a meaningful speed boost. I focused on 
float64.

* NOTE: I ended up doing everything in Numpy 1.6.2 as opposed to using 
the latest stuff. I am going to guess all my findings still apply but I 
will only be able to confirm until later.

=========================================================

In this email I include,
1) Main bottlenecks I found which I list and refer to as (a), (b) and (c).
2) The benchmark tests I made and their speed ups
3) Details on the actual changes to the C code

=========================================================

Summary of conclusions,
- Our code is finally running as fast as it used to by doing some 
changes in Numpy and also some minor changes in our code. Half of our 
problems were caused by instantiating small arrays several times which 
is fairly slow in Numpy. The other half of our problems were are caused 
by the slow math performance of Numpy scalars. We did find a particular 
python function in our code that was a big candidate to be rewritten in 
C and just got it done.
- In Numpy I did four sets of changes in the source code. I believe 
three of them are relevant to every one using Numpy and one of them is 
probably not going to be very popular.
- The main speed up is in float64 scalar operations and creation of 
small arrays from lists or tuples. The speed up in small array 
operations is only marginal but I believe there is potential to get them 
at least twice as fast.

=========================================================
1)

By profiling the program I found three generic types of bottlenecks in 
Numpy that were affecting our code,

a) Operations that result in Python internally raising an error
e.g.
PyObject_GetAttrString(obj, "__array_priority__")
when __array_priority__ is not an attribute of obj

b) Creation / destruction of scalar array types . In some places this 
was happening unnecessarily .

c) Ufuncs trying to figure out the proper type for an operation (e.g. if 
I multiply a float64 array by a float64 array, a fair amount of time is 
spent deciding that it should use float64)

I came up with specific changes to address (a) and (b) . I gave up on 
(c) for now since I couldn't think of a way to speed it up without a 
large re-write and I really don't know the Numpy code (never saw it 
before this work).

=========================================================
2)

The tests I did were (some are python natives for reference),
1) Array * Array
2) PyFloat * Array
3) Float64 * Array
4) PyFloat + Array
5) Float64 + Array
6) PyFloat * PyFloat
7) Float64 * Float64
8) PyFloat * Float64
9) PyFloat * vector1[1]
10) PyFloat + Float64
11) PyFloat < Float64
12) if PyFloat < Float64:
13) Create array from list
14) Assign PyFloat to all
15) Assign Float64 to all
16) Float64  * Float64 * Float64 * Float64 * Float64
17) Float64 * Float64 * Float64 * Float64 * Float64
18) Float64 ** 2
19) PyFloat ** 2

where
Array -> Numpy array of float64 of two elements (vector1 = array( [2.0, 
3.1] )).
PyFloat -> pyFloat = 3.1
Float64 -> Numpy scalar 'float64' (scalarFloat64 = vector1[1])
Create array from list  -> newVec = array([0.2, 0.3], dtype="float64")
Assign PyFloat to all -> vector1[:] = pyFloat
Assign Float64 to all -> vector1[:] = scalarFloat64

I ran every test 100000 and timed it in seconds.
These are the base timings with the original Numpy
     TIME[s]    TEST
1)    0.2003    Array * Array
2)    0.2502    PyFloat * Array
3)    0.2689    Float64 * Array
4)    0.2469    PyFloat + Array
5)    0.2640    Float64 + Array
6)    0.0055    PyFloat * PyFloat
7)    0.0278    Float64 * Float64
8)    0.0778    PyFloat * Float64
9)    0.0893    PyFloat * vector1[1]
10)    0.0767    PyFloat + Float64
11)    0.0532    PyFloat < Float64
12)    0.0543    if PyFloat < Float64 :
13)    0.6788    Create array from list
14)    0.0708    Assign PyFloat to all
15)    0.0775    Assign Float64 to all
16)    0.2994    Float64  * pyFloat * pyFloat * pyFloat * pyFloat
17)    0.1053    Float64 * Float64 * Float64 * Float64 * Float64
18)    0.0918    Float64 ** 2
19)    0.0156    pyFloat ** 2

- Test (13) is the operation that takes the longest overall
- PyFloat * Float64 is 14 times slower than PyFloat * PyFloat

By addressing bottleneck (a) I got the following ratios of time
(BaseTime/NewTime)
i.e. RATIO > 1 means GOOD .
     RATIO    TEST
1)    1.1    Array * Array
2)    1.1    PyFloat * Array
3)    1.1    Float64 * Array
4)    1.1    PyFloat + Array
5)    1.2    Float64 + Array
6)    1.0    PyFloat * PyFloat
7)    1.7    Float64 * Float64
8)    2.8    PyFloat * Float64
9)    2.1    PyFloat * vector1[1]
10)    2.8    PyFloat + Float64
11)    3.3    PyFloat < Float64
12)    3.3    if PyFloat < Float64:
13)    3.2    Create array from list
14)    1.2    Assign PyFloat to all
15)    1.2    Assign Float64 to all
16)    2.9    Float64  * pyFloat * pyFloat * pyFloat * pyFloat
17)    1.7    Float64 * Float64 * Float64 * Float64 * Float64
18)    2.4    Float64 ** 2
19)    1.0    pyFloat ** 2

Speed up from Test (13) and (16) resulted in a big speed boost in our code

Keeping the changes above. By addressing (b) in a way that did not 
change the data types of the return values I got the following ratios of 
time
(BaseTime/NewTime)
i.e. RATIO > 1 means GOOD .
     RATIO    TEST
1)    1.1    Array * Array
2)    1.1    PyFloat * Array
3)    1.2    Float64 * Array
4)    1.1    PyFloat + Array
5)    1.2    Float64 + Array
6)    1.0    PyFloat * PyFloat
7)    1.7    Float64 * Float64
8)    4.3    PyFloat * Float64
9)    3.1    PyFloat * vector1[1]
10)    4.4    PyFloat + Float64
11)    9.3    PyFloat < Float64
12)    9.2    if PyFloat < Float64 :
13)    3.2    Create array from list
14)    1.2    Assign PyFloat to all
15)    1.2    Assign Float64 to all
16)    4.7    Float64  * pyFloat * pyFloat * pyFloat * pyFloat
17)    1.8    Float64 * Float64 * Float64 * Float64 * Float64
18)    2.4    Float64 ** 2
19)    1.0    pyFloat ** 2

- Scalar operations are quite a bit faster but
PyFloat * Float64 is 2.9 times slower than PyFloat * PyFloat

I decided to then tackle (b) even further by changing things like
PyFloat * Float64
to return a PyFloat as opposed to a Float64.
This is the change that I don't think is going to be very popular.

This is what I got,
1)    1.1    Array * Array
2)    1.1    PyFloat * Array
3)    1.2    Float64 * Array
4)    1.1    PyFloat + Array
5)    1.2    Float64 + Array
6)    1.0    PyFloat * PyFloat
7)    3.2    Float64 * Float64
8)    8.1    PyFloat * Float64
9)    4.1    PyFloat * vector1[1]
10)    8.3    PyFloat + Float64
11)    9.4    PyFloat < Float64
12)    9.2    if PyFloat < Float64 :
13)    3.2    Create array from list
14)    1.2    Assign PyFloat to all
15)    1.2    Assign Float64 to all
16)    17.3    Float64  * pyFloat * pyFloat * pyFloat * pyFloat
17)    3.3    Float64 * Float64 * Float64 * Float64 * Float64
18)    2.4    Float64 ** 2
19)    1.0    pyFloat ** 2

- Test (16) shows how only one Float64 spoils the speed of trivial math. 
Now imagine the effect in hundreds of lines like that.
- Even Test (17) got faster which uses only Float64
- Test (18) Float64 ** 2 is still returning a float64 in this run.

Regarding bottleneck (c) . Deciding the type of UFunc.
I hacked a version for testing purposes to check the potential speed up 
(some dirty changes in generate_umath.py). This version avoided the 
overhead of the call to the calls to find the matching ufunc.
The ratio of speed up for something like Array * Array was only 1.6 . 
This was not too exciting so I walked away for now.

=========================================================
3)

These are the actual changes to the C code,
For bottleneck (a)

In general,
- avoid calls to PyObject_GetAttrString when I know the type is
List, None, Tuple, Float, Int, String or Unicode

- avoid calls to PyObject_GetBuffer when I know the type is
List, None or Tuple

a.1)
In arrayobject.h after the line
#include "npy_interrupt.h"

I added a couple of #define

//Check for exact native types that for sure do not
//support array related methods. Useful for faster checks when
//validating if an object supports these methods
#define ISEXACT_NATIVE_PYTYPE(op) (PyList_CheckExact(op) || (Py_None == 
op) || PyTuple_CheckExact(op) || PyFloat_CheckExact(op) || 
PyInt_CheckExact(op) || PyString_CheckExact(op) || PyUnicode_CheckExact(op))

//Check for exact native types that for sure do not
//support buffer protocol. Useful for faster checks when
//validating if an object supports the buffer protocol.
#define NEVERSUPPORTS_BUFFER_PROTOCOL(op) ( PyList_CheckExact(op) ||  
(Py_None == op) || PyTuple_CheckExact(op) )

a.2)
In common.c above the line
     if ((ip=PyObject_GetAttrString(op, "__array_interface__"))!=NULL) {

I added
     if (ISEXACT_NATIVE_PYTYPE(op)){
         ip = NULL;
     }
     else{

and close the } before the line #if !defined(NPY_PY3K)

In common.c above the line
     if (PyObject_HasAttrString(op, "__array__")) {

I added
     if (ISEXACT_NATIVE_PYTYPE(op)){
     }
     else{

and close the } before the line #if defined(NPY_PY3K)

In common.c above the line
     if (PyObject_GetBuffer(op, &buffer_view, PyBUF_FORMAT|PyBUF_STRIDES

I added
     if ( NEVERSUPPORTS_BUFFER_PROTOCOL(op) ){

     }
     else{

and close the } before the line #endif

a.3)
In ctors.c above the line
     if ((e = PyObject_GetAttrString(s, "__array_struct__")) != NULL) {

I added
     if (ISEXACT_NATIVE_PYTYPE(s)){
         e = NULL;
     }
     else{

and close the } before the line n = PySequence_Size(s);

In ctors.c above the line
     attr = PyObject_GetAttrString(input, "__array_struct__");

I added
     if (ISEXACT_NATIVE_PYTYPE(input)){
         attr = NULL;
         return Py_NotImplemented;
     }
     else{

and close the } before the line if (!NpyCapsule_Check(attr)) {

In ctors.c above the line
     inter = PyObject_GetAttrString(input, "__array_interface__");

I added
     if (ISEXACT_NATIVE_PYTYPE(input)){
         inter = NULL;
         return Py_NotImplemented;
     }
     else{

and close the } before the line if (!PyDict_Check(inter)) {

In ctors.c above the line
     array_meth = PyObject_GetAttrString(op, "__array__");

I added
     if (ISEXACT_NATIVE_PYTYPE(op)){
         array_meth = NULL;
         return Py_NotImplemented;
     }
     else{

and close the } before the line if (context == NULL) {

In ctors.c above the line
     if (PyObject_GetBuffer(s, &buffer_view, PyBUF_STRIDES) == 0 ||

I added
     if ( NEVERSUPPORTS_BUFFER_PROTOCOL(s) ){

     }
     else{

and close the } before the line #endif

a.4)

In multiarraymodule.c above the line
     ret = PyObject_GetAttrString(obj, "__array_priority__");

I added
     if (ISEXACT_NATIVE_PYTYPE(obj)){
         ret = NULL;
     }
     else{

and close the } before the line if (PyErr_Occurred()) {

For bottleneck (b)

b.1)
I noticed that PyFloat * Float64 resulted in an unnecessary "on the fly" 
conversion of the PyFloat into a Float64 to extract its underlying C 
double value. This happened in the function
_double_convert_to_ctype which comes from the pattern, 
_ at name@_convert_to_ctype

I ended up splitting _ at name@_convert_to_ctype into two sections. One for 
double types and one for the rest of the types where I extract the C 
value directly if it passes the check to PyFloat_CheckExact (It could be 
extended for other types).

in scalarmathmodule.c.src I added,
/**begin repeat
  * #name = double#
  * #Name = Double#
  * #NAME = DOUBLE#
  * #PYCHECKEXACT = PyFloat_CheckExact#
  * #PYEXTRACTCTYPE = PyFloat_AS_DOUBLE#
  */

static int
_ at name@_convert_to_ctype(PyObject *a, npy_ at name@ *arg1)
{
     PyObject *temp;

     if (@PYCHECKEXACT@(a)){
         *arg1 = @PYEXTRACTCTYPE@(a);
         return 0;
     }

... The rest of this function is the implementation of the original
_ at name@_convert_to_ctype(PyObject *a, npy_ at name@ *arg1)

The original implementation of _ at name@_convert_to_ctype
does not include double anymore, i.e.
/**begin repeat
  * #name = byte, ubyte, short, ushort, int, uint, long, ulong, longlong,
  *         ulonglong, half, float, longdouble, cfloat, cdouble, 
clongdouble#
  * #Name = Byte, UByte, Short, UShort, Int, UInt, Long, ULong, LongLong,
  *         ULongLong, Half, Float, LongDouble, CFloat, CDouble, 
CLongDouble#
  * #NAME = BYTE, UBYTE, SHORT, USHORT, INT, UINT, LONG, ULONG, LONGLONG,
  *         ULONGLONG, HALF, FLOAT, LONGDOUBLE, CFLOAT, CDOUBLE, 
CLONGDOUBLE#
  */

static int
_ at name@_convert_to_ctype(PyObject *a, npy_ at name@ *arg1)

b.2) This is the change that may not be very popular among Numpy users. 
I modified Float64 operations to return a Float instead of Float64. I 
could not think or see any ill effects and I got a fairly decent speed 
boost.

in scalarmathmodule.c.src I modified to this,

/**begin repeat
  * 
#name=(byte,ubyte,short,ushort,int,uint,long,ulong,longlong,ulonglong)*13,
  *       (half, float, double, longdouble, cfloat, cdouble, clongdouble)*6,
  *       (half, float, double, longdouble)*2#
  * 
#Name=(Byte,UByte,Short,UShort,Int,UInt,Long,ULong,LongLong,ULongLong)*13,
  *       (Half, Float, Double, LongDouble, CFloat, CDouble, CLongDouble)*6,
  *       (Half, Float, Double, LongDouble)*2#
  * #oper=add*10, subtract*10, multiply*10, divide*10, remainder*10,
  *              divmod*10, floor_divide*10, lshift*10, rshift*10, and*10,
  *              or*10, xor*10, true_divide*10,
  *       add*7, subtract*7, multiply*7, divide*7, floor_divide*7, 
true_divide*7,
  *       divmod*4, remainder*4#
  * #fperr=1*70,0*50,1*10,
  *        1*42,
  *        1*8#
  * #twoout=0*50,1*10,0*70,
  *         0*42,
  *         1*4,0*4#
  * 
#otyp=(byte,ubyte,short,ushort,int,uint,long,ulong,longlong,ulonglong)*12,
  *              float*4, double*6,
  *       (half, float, double, longdouble, cfloat, cdouble, clongdouble)*6,
  *       (half, float, double, longdouble)*2#
  * 
#OName=(Byte,UByte,Short,UShort,Int,UInt,Long,ULong,LongLong,ULongLong)*12,
  *              Float*4, Double*6,
  *        (Half, Float, Double, LongDouble, CFloat, CDouble, 
CLongDouble)*6,
  *        (Half, Float, Double, LongDouble)*2#
  * 
#OutUseName=(Byte,UByte,Short,UShort,Int,UInt,Long,ULong,LongLong,ULongLong)*12,
  *              Float*4, out*6,
  *        (Half, Float, out, LongDouble, CFloat, CDouble, CLongDouble)*6,
  *        (Half, Float, out, LongDouble)*2#
  * #AsScalarArr=(1,1,1,1,1,1,1,1,1,1)*12,
  *              1*4, 0*6,
  *        (1, 1, 0, 1, 1, 1, 1)*6,
  *        (1, 1, 0, 1)*2#
  * 
#RetValCreate=(PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New)*12,
  *              PyArrayScalar_New*4, PyFloat_FromDouble*6,
  *        (PyArrayScalar_New, PyArrayScalar_New, PyFloat_FromDouble, 
PyArrayScalar_New, PyArrayScalar_New, PyArrayScalar_New, 
PyArrayScalar_New)*6,
  *        (PyArrayScalar_New, PyArrayScalar_New, PyFloat_FromDouble, 
PyArrayScalar_New)*2#
  */

#if !defined(CODEGEN_SKIP_ at oper@_FLAG)

static PyObject *
@name at _@oper@(PyObject *a, PyObject *b)
{

... Same as before  and ends with...

#else
     ret = @RetValCreate@(@OutUseName@);
     if (ret == NULL) {
         return NULL;
     }
     if (@AsScalarArr@)
         PyArrayScalar_ASSIGN(ret, @OName@, out);
#endif
     return ret;
}
#endif

/**end repeat**/

I still need to do the section for when there are two return values and 
the power function. I am not sure what else could be there.

=========================================================

That's about it. Sorry for the long email. I tried to summarize as much 
as possible.

Let me know if you have any questions or if you want the actual files I 
modified.

Cheers,

Raul Cota