[Numpy-discussion] Advice please on efficient subtotal function

Fri Dec 29 04:05:23 EST 2006

Hi,

I'm looking for efficient ways to subtotal a 1-d array onto a 2-D grid. This
is more easily explained in code that words, thus:

for n in xrange(len(data)):
    totals[ i[n], j[n] ] += data[n]

data comes from a series of PyTables files with ~200m rows. Each row has ~20
cols, and I use the first three columns (which are 1-3 char strings) to form
the indexing functions i[] and j[], then want to calc averages of the
remaining 17 numerical cols. 

I have tried various indirect ways of doing this with searchsorted and
bincount, but intuitively they feel overly complex solutions to what is
essentially a very simple problem.

My work involved comparing the subtotals for various different segmentation
strategies (the i[] and j[] indexing functions). Efficient solutions are
important because I need to make many passes through the 200m rows of data.
Memory usage is the easiest thing for me to adjust by changing how many rows
of data to read in for each pass and then reusing the same array data buffers.

Thanks in advance for any suggestions!

Stephen