[Numpy-discussion] Fortran order arrays to and from numpy arrays

Fri Feb 23 16:19:05 EST 2007

Hi,

I'm currently puzzling over how to best convert (column major order) matlab
arrays to numpy arrays and vice versa -- I'm looking for a solution that's
simple, general and reasonably fast -- being also applicable to Numeric arrays
would be a plus (I'd like to retain Numeric compatibility for the current
version of the code).

The solution that mlabwrap (http://mlabwrap.sf.net) currently employs to
transfer arrays between matlab and python is to have a low level C++ extension
(mlabraw) copying the data into a new numpy array created with
Pyarray_SimpleNew and using a simple or nested for-loop to copy the data
around (only rank up to 2 is currently handled), e.g.

    //matlab matrix -> numpy
    for (mwIndex lCol = 0; lCol < lCols; lCol++) {
      double *lDst = (double *)(lRetval->data) + lCol;
      for (mwIndex lRow = 0; lRow < lRows; lRow++, lDst += lCols) {
        *lDst = *lPR++;}}

    //numpy 2d array -> matlab
    npy_intp lRowDelta = pStrides[0]/sizeof(T);
    npy_intp lColDelta = pStrides[1]/sizeof(T);
    for (npy_intp lCol=0; lCol < pCols; lCol++) {
      T *lSrc = pSrc + lCol*lColDelta;
      for (npy_intp lRow=0; lRow < pRows; lRow++, lSrc += lRowDelta) {
        *pDst++ = *lSrc;}}

Some simple timing results suggest 2 things:

1. There is an enormous overhead associated with calling into matlab (e.g.
   just using mlabraw to evaluate ``sin(1);`` in matlab, without accessing the
   result the result is about 1000 times slower than ``math.sin(1)``. I have
   no idea where this rather huge overhead comes from, but I suspect there
   isn't much one can do about it.

2. Despite this overhead, copying around large arrays (e.g. >=1e5 elements) in
   above way causes notable additional overhead. Whilst I don't think there's
   a sane way to avoid copying by sharing data between numpy and matlab the
   copying could likely be done better.

I guess the moral from 1 might be that ctypes will be the way to go for
future version, I would assume that the overhead associated with ctypes
doesn't matter that much and that it will be much less painful to maintain
than having a C or C++ extension module, especially given that the Mathworks
seem fond of changing the C API every minor release. I haven't used ctypes yet
so I'd be interested to know what people think.

The moral from 2. might be just to memcpy the data wholesale where possible
and transpose and reshape afterwards or, in numpy, to set the fortran-order
flag (and query fortran-contiguousness on copying to matlab) -- I suppose
that specifying fortran order and tranposing/reshaping are pretty much
equivalent? So matlab to numpy would look roughly so:

  a_fortran_order = memcpy_raw_matlab_data_into_numpy_array(ptr)
  a = a_fortran_order.T.reshape(a_fortran_order.shape)

I would assume that this ought be on average more efficient than always
transforming to C-contiguous (as some of the time the returned array will
anyway be transposed afterwards or accessed in a fashion were column major is
faster; the case that several operations that profit from row-major storage
are carried out on the same array seems unlikely).

Unfortunately I don't see an easy way to use the same approach the other way
(matlab doesn't seem to offer much on the C level to manipulate arrays), so
I'd presumably need something like:

  stuff_into_matlab_array(a.T.reshape(a.shape).copy())

the question is how to avoid doing two copies.

Any comments appreciated,

'as