[Numpy-discussion] Advice please on efficient subtotal function

Fri Dec 29 19:22:16 EST 2006

Thanks Francesc, but I am already planning to read the data in block-wise as
you suggest. My question is rather how best to update the subtotals for each
block in a parallel way using numpy efficiently, rather than a simplistic and
slow element-by-element loop. 

I can't use a simple sum(), as in your benchmark example or Greg's reply,
because I need to do:
    for n in xrange(len(data)):
        totals[ i[n], j[n] ] += data[n]
and not 
    for n in xrange(len(data)):
        totals[n] += data[n]

My best solution so far is roughly like this:
 - read in next block of 100k or so rows (taking into account the PyTables
table's _v_maxTuples and _v_chunksize)
 - calculate the subtotal index arrays i and j
 - do a lexsort() on [i, j, n]
 - partition the sorted [i, j, n] into subsets where the i and j arrays change
values. The k_th such subset is thus 
    s_k = [ i_k, j_k, [n_k0, ..., n_kN] ] 
 - update the subtotals for each subset in the block
    totals[i_k, j_k]+= sum(data[n_k0, ..., n_kN])

This should be reasonably efficient, but it's messy, and I'm not familar
enough with numpy's indexing tricks to get this right first time.

Maybe instead I'll have a go at writing a Pyrex function that implements the
simple loop at C speed:
   subtotal2d(data_array, idx_array, out=None, dtype=None)
where data_array is Nx1, idx_array is NxM and out is M-dimensional.

Incidentally, there's one other function I'd find useful here in forming the
index arrays i[] and j[], a fast translate-from-dict function:
  arr2 = fromiter( d[a] for a in arr1 )
My initial impression is that a C version would be substantially faster; maybe
I should do some benchmarking to see whether a pure Python/numpy approach is
actually faster than I expect.

Cheers, and thanks for any further suggestions,

Stephen

Francesc Altet <faltet at carabos.com> wrote:

> A Divendres 29 Desembre 2006 10:05, Stephen Simmons escrigué:
> > Hi,
> >
> > I'm looking for efficient ways to subtotal a 1-d array onto a 2-D grid.
> > This is more easily explained in code that words, thus:
> >
> > for n in xrange(len(data)):
> >     totals[ i[n], j[n] ] += data[n]
> >
> > data comes from a series of PyTables files with ~200m rows. Each row has
> > ~20 cols, and I use the first three columns (which are 1-3 char strings)
to
> > form the indexing functions i[] and j[], then want to calc averages of the
> > remaining 17 numerical cols.
> >
> > I have tried various indirect ways of doing this with searchsorted and
> > bincount, but intuitively they feel overly complex solutions to what is
> > essentially a very simple problem.
> >
> > My work involved comparing the subtotals for various different
segmentation
> > strategies (the i[] and j[] indexing functions). Efficient solutions are
> > important because I need to make many passes through the 200m rows of
data.
> > Memory usage is the easiest thing for me to adjust by changing how many
> > rows of data to read in for each pass and then reusing the same array data
> > buffers.
> 
> Well, from your words I guess you should already have tested this, but just
in
> 
> case. As PyTables saves data in tables row-wise, it is always faster using 
> the complete row for computations in each iteration than using just a single

> column. This is shown in the small benchmark that I'm attaching at the end
of 
> the message. Here is its output for a table with 1m rows:
> 
> time for creating the file--> 12.044
> time for using column reads --> 46.407
> time for using the row wise iterator--> 73.036
> time for using block reads (row wise)--> 5.156
> 
> So, using block reads (in case you can use them) is your best bet.
> 
> HTH,
> 
>
--------------------------------------------------------------------------------------
> import tables
> import numpy
> from time import time
> 
> nrows = 1000*1000
> 
> # Create a table definition with 17 double cols and 3 string cols
> coltypes = numpy.dtype("f8,"*17 + "S3,"*3)
> 
> t1 = time()
> # Create a file with an empty table. Use compression to minimize file size.
> f = tables.openFile("/tmp/prova.h5", 'w')
> table = f.createTable(f.root, 'table', numpy.empty(0, coltypes),
>                       filters=tables.Filters(complevel=1, complib='lzo'))
> # Fill the table with default values (empty strings and zeros)
> row = table.row
> for nrow in xrange(nrows):
>     row.append()
> f.close()
> print "time for creating the file-->", round(time()-t1, 3)
> 
> # *********** Start benchmarks **************************
> f = tables.openFile("/tmp/prova.h5", 'r')
> table = f.root.table
> colnames = table.colnames[:-3]  # exclude the string cols
> 
> # Loop over the table using column reads
> t1 = time(); cum = numpy.zeros(17)
> for ncol, colname in enumerate(colnames):
>     col = table.read(0, nrows, field=colname)
>     cum[ncol] += col.sum()
> print "time for using column reads -->", round(time()-t1, 3)
> 
> # Loop over the table using its row iterator
> t1 = time(); cum = numpy.zeros(17)
> for row in table:
>     for ncol, colname in enumerate(colnames):
>         cum[ncol] += row[colname]
> print "time for using the row iterator-->", round(time()-t1, 3)
> 
> # Loop over the table using block reads (row wise)
> t1 = time(); cum = numpy.zeros(17)
> step = 10000
> for nrow in xrange(0, nrows, step):
>     ra = table[nrow:nrow+step]
>     for ncol, colname in enumerate(colnames):
>         cum[ncol] += ra[colname].sum()
> print "time for using block reads (row wise)-->", round(time()-t1, 3)
> 
> f.close()