[Numpy-discussion] Advice please on efficient subtotal function

Sat Dec 30 09:45:48 EST 2006

Hi Stephen,
>
> I'm looking for efficient ways to subtotal a 1-d array onto a 2-D
> grid. This
> is more easily explained in code that words, thus:
>
> for n in xrange(len(data)):
>     totals[ i[n], j[n] ] += data[n]
>
> data comes from a series of PyTables files with ~200m rows. Each row
> has ~20
> cols, and I use the first three columns (which are 1-3 char strings)
> to form
> the indexing functions i[] and j[], then want to calc averages of the
> remaining 17 numerical cols. 
>
> I have tried various indirect ways of doing this with searchsorted and
> bincount, but intuitively they feel overly complex solutions to what
> is
> essentially a very simple problem.
>
> My work involved comparing the subtotals for various different
> segmentation
> strategies (the i[] and j[] indexing functions). Efficient solutions
> are
> important because I need to make many passes through the 200m rows of
> data.
> Memory usage is the easiest thing for me to adjust by changing how
> many rows
> of data to read in for each pass and then reusing the same array data
> buffers.
>

It looks as if the values in your i and j columns come from a limited 
range, so you may consider encoding pairs of (i,j) values into one int 
using a suitable encoding function (e.g. ij = i+K*j if both i and j are
non-negative and K=max(i)+1). You could then use bincount(ij, data) to
get the sums per encoded (i,j) pair. This should be efficient, and the
complexity is only in the encoding/decoding steps.

Best regards,
Andreas
----
Dr. Andreas Eisele                     Senior Researcher
DFKI GmbH, Language Technology Lab        eisele at dfki.de
Stuhlsatzenhausweg 3               tel: +49-681-302-5285
D-66123 Saarbrücken                fax: +49-681-302-5338