[Numpy-discussion] No Copy Reduce Operations

Mon Jul 28 10:31:49 EDT 2008

Luis Pedro Coelho wrote:
> Hello all,
>
> Numpy arrays come with several reduce operations: sum(), std(), argmin(), 
> min(), ....
>
> The traditional implementation of these suffers from two big problems: It is 
> slow and it often allocates intermediate memory. I have code that is failing 
> with OOM (out of memory) exceptions in calls to ndarray.std(). I regularly 
> handle arrays with 100 million entries (have a couple of million objects * 20 
> features per object = 100 million doubles), so this is a real problem for me.
>
> This being open-source, I decided to solve the problem. My first idea was to 
> try to improve the numpy code. I failed to see how to do that while 
> supporting everything that numpy does (multiple types, for example), so I 
> started an implementation of reduce operations that uses C++ templates to 
> make code optimised into the types it actually uses, choosing the right 
> version to use at run time. In the spirit or release-early/release-often, I 
> attach the first version of this code that runs.
>
> BASIC IDEA
> ===============
>
> ndarray.std does basically the following (The examples are in pseudo-code even 
> though the implementation happens to be in C):
>
> def stddev(A):
> 	mu = A.mean()
> 	diff=(A-mu)
> 	maybe_conj=(diff if not complex(A) else diff.conjugate())
> 	diff *= maybe_conj
> 	return diff.sum()
>
> With a lot of temporary arrays. My code does:
>
> def stddev(A):
> 	mu = A.mean() # No getting around this temporary
> 	std = 0
> 	for i in xrange(A.size):
> 		diff = (A[i]-mu)
> 		if complex(A):
> 			diff *= conjugate(diff)
> 		else:
> 			diff *= diff
> 		std += diff
> 	return sqrt(diff/A.size)
>
> Of course, my code does it while taking into account the geometry of the 
> array. It handles arrays with arbitrary strides.
>
> I do it while avoiding copying the array at any time (while most of the 
> existing code will transpose/copy the array so that it is well behaved in 
> memory).
>
>
> IMPLEMENTATION
> ===============
>
> I have written some template infrastructure so that, if I wanted a very fast 
> entropy calculation, on a normalised array, you could do:
>
> template < ... >
> void compute_entropy(BaseIterator data, BaseIterator past, ResultsIterator 
> results) {
> 	while (data != past) {
> 		if (*data) *result += *data * std::log(*data);
> 		++data;
> 		++result;
> 	}
> }
>
> and the machinery will instantiate this in several variations, deciding at 
> run-time which one to call. You just have to write a C interface function 
> like
>
> PyObject* fast_entropy(PyArrayObject *self, PyObject *args, PyObject *kwds)
> {
>     int axis=NPY_MAXDIMS;
>     PyArray_Descr *dtype=NULL;
>     PyArrayObject *out=NULL;
>     static char *kwlist[] = {"array","axis", "dtype", "out", NULL};
>
>     if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|O&O&O&", kwlist,
>                                      &self,
>                                      PyArray_AxisConverter,&axis,
>                                      PyArray_DescrConverter2, &dtype,
>                                      PyArray_OutputConverter, &out)) {
>         Py_XDECREF(dtype);
>         return NULL;
>     }
>
>     int num = _get_type_num_double(self->descr, dtype);
>     Py_XDECREF(dtype);
>     return compress_dispatch<EntropyCompute>(self, out, axis, num, 
> EmptyType()); // This decides which function to call
> }
>
>
> For contiguous arrays, axis=None, this becomes
>
> void compute_entropy(Base* data, Base* past, Results* result) {
> 	while (data != past) {
> 		if (*data) *result += *data * std::log(*data);
> 		++data;
> 	}
> }
>
> which is probably as fast as it can be.
>
> If the array is not contiguous, this becomes
>
> void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, 
> Results* result) {
> 	while (data != past) {
> 		if (*data) *result += *data * std::log(*data);
> 		++data;
> 	}
> }
>
> where numpy_iterator is a type that knows how to iterate over numpy arrays 
> following strides.
>
> If axis is not None, then the result will not be a single value, it will be an 
> array. The machinery will automatically create the array of the right size 
> and pass it to you with so that the following gets instantiated:
>
> void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, 
> numpy_iterator<ResultType> results) {
> 	while (data != past) {
> 		if (*data) *results += *data * std::log(*data);
> 		++data;
> 		++results;
> 	}
> }
>
> The results parameter has the strides set up correctly to iterate over 
> results, including looping back when necessary so that the code works as it 
> should.
>
> Notice that the ++results operation seems to be dropping in and out. In fact, 
> I was oversimplifying above. There is no instantiation with Result*, but with 
> no_iteration<Result> which behaves like Result*, but with an empty operator 
> ++(). You never change your template implementation.
>
> (The code above was for explanatory purposes, not an example of working code. 
> The interface I actually used takes more parameters which are not very 
> important for expository purposes. This allows you to, for example, implement 
> the ddof parameter in std()).
>
> ADVANTAGES
> ===========
>
> For most operations, my code is faster (it's hard to beat ndarray.sum(), but 
> easy to beat ndarray.std()) than numpy on both an intel 32 bit machine and an 
> amd 64 bit machine both newer than one year (you can test it on your machine 
> by runnning profile.py). For some specific operations, like summing along a 
> non-last axis on a moderately large array, it is slower (I think that the 
> copy&process tactic might be better in this case than the no-copy/one pass 
> operation I am using). In most cases I tested, it is faster. In particular, 
> the basic case (a well-behaved array), it is faster.
>
> More important than speed (at least for me) is the fact that this does not 
> allocate more memory than needed. This will not fail with OOM errors.
>
> It's easy to implement specific optimisations. For example, replace a sum 
> function for a specific case to call AMD's framewave SIMD library (which uses 
> SIMD instructions):
>
> void compute_sum(short* data, short* past, no_iteration<short> result) {
> 	fwiSum_16s_C1R  ( data, sizeof(short), past-start, &*result);
> }
>
> or, compute the standard deviation of an array of boolean with a single pass 
> (sigma = sqrt(p(1-p))):
>
> void compute_std(bool* data, bool* past, no_iteration<ResType> result) {
>     size_type N = (past-data);
>     size_type pos = 0;
>     while (data != past) {
>         if (*data) ++pos;
>         ++data;
>     }
>     *result = std::sqrt(ResType(pos)/ResType(N)*(1-ResType(pos)/ResType(N)));
> }
>
> NOT IMPLEMENTED (YET)
> ======================
>
> Non-aligned arrays are not implemented.
> out arrays have to be well behaved.
>
> My current idea is to compromise and make copies in this case. I could also, 
> trivially, write a thing that handled those cases without copying, but I 
> don't think it's worth the cost in code bloat.
>
> * argmax()/argmin(). This is a bit harder to implement than the rest, as it 
> needs a bit more machinery. I think it's possible, but I haven't gotten 
> around to it.
>
> * ptp(). I hesitate whether to simply do it in Python:
>
> def ncr_ptp(A,axis=None,dtype=None,out=None):
>     res=ncr.max(A,axis=axis,dtype=dtype,out=out)
>     min=ncr.min(A,axis=axis,dtype=dtype)
>     res -= max
>     return res
>
> Does two passes over the data, but no extra copying. I could do the same using 
> the Python array API, of course.
>
> DISADVANTAGES
> ==============
>
> It's C++. I don't see it as a technical disadvantage, but I understand some 
> people might not like it. If this was used in the numpy code base, it could 
> replace the current macro language (begin repeat // end repeat), so the total 
> number of languages used would not increase.
>
> It generates too many functions. Disk is cheap, but do we really need a well 
> optimised version of std() for the case of complex inputs and boolean output? 
> What's a boolean standard deviation anyway?.  Maybe some tradeoff is 
> necessary: optimise the defaults and make others possible.
>
> BUGS
> =====
>
> I am not correctly implementing the dtype parameter. I thought it controlled 
> the type of the intermediate results, but if I do
>
> import numpy
> A=numpy.arange(10)
> A.mean(dtype=numpy.int8)
>
> I get
>
> 4.5
>
> which surprises me!
>
> Furthermore,
>
> A.mean(dtype=numpy.int8).dtype
>
> returns
>
> dtype('float64')
>
> In my implementation, mean(numpy.arange(10),dtype=numpy.int8) is 4
>
> (This might be considered a numpy bug --- BTW, i am not running subversion for 
> this comparison).
>
> * I also return only either python longs or python floats. This is a trivial 
> change, I just don't know all the right function names to return numpy types.
>
> * A couple of places in the code still have a FIXME on them
>
> FUTURE
> =======
>
> I consider this code proof-of-concept. It's cool, it demonstrate that the idea 
> works, but it is rough and needs cleaning up. There might even be bugs in it 
> (especially the C interface with Python is something I am not so familiar 
> with)!
>
> I see three possible paths for this:
> 	(1) You agree that this is nice, it achieves good results and I should strive 
> to integrate this into numpy proper, replacing the current implementation. I 
> would start a branch in numpy svn and implement it there and finally merge 
> into the main branch. I would be perfectly happy to relicense to BSD if this 
> is the case.
> 	One could think of the array-to-array (A+B, A+=2, A != B,...) operations 
> using a similar scheme
> 	This would imply that part of the C API would be obsoleted: the whole 
> functions array would stop making sense. I don't know how widespread it's 
> used.
>
> 	(2) I keep maintaining this as a separate package and whoever wants to use 
> it, can use it. I will probably keep it GPL, for now.
> 	Of course, at some later point in time, one could integrate this into the 
> main branch (Again, in this case, I am happy to relicense).
>
> 	(3) Someone decides to take this as a challenge and re-implements ndarray.std 
> and the like so that it uses less memory and is faster, but does it still in 
> C. I am not much of a C programmer, so I don't see how this could be done 
> without really ugly reliance on the preprocessor, but maybe it could be done.
>
> What do you think?
>
> bye,
> Luís Pedro Coelho
> PhD Student in Computational Biology
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
FYI for computing variance (and hence std) you probably should be using 
Knuth's (or Welford's) one pass approach (On-line algorithm) to avoid 
recomputing the mean:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Also for large arrays, you may also want to maximize the precision to 
avoid potential overflow.

Bruce