[SciPy-user] Dealing with Large Data Sets

Anne Archibald peridot.faceted at gmail.com
Sat May 10 15:55:22 EDT 2008


2008/5/10 lechtlr <lechtlr at yahoo.com>:
> I try to create an array called 'results' as provided in an example below.
> Is there a way to do this operation more efficiently when the number of
> 'data_x' array gets larger ?  Also, I am looking for pointers to eliminate
> intermediate 'data_x' arrays, while creating 'results' in the following
> procedure.

The rule of thumb is, if you want to do the same thing to many
elements, just create an array of input values, then write the
calculation as if you had a single input value. Most numpy functions
act elementwise.

> from numpy import *
> from numpy.random import *
>
> # what is the best way to create an array named 'results' below
> # when number of 'data_x' (i.e., x = 1, 2.....1000) is large.
> # Also nrows and ncolumns can go upto 10000
>
> nrows = 5
> ncolumns = 10
>
> data_1 = zeros([nrows, ncolumns], 'd')
> data_2 = zeros([nrows, ncolumns], 'd')
> data_3 = zeros([nrows, ncolumns], 'd')
>
> # to store squared sum of each column from the arrays above
> results = zeros([3,ncolumns], 'd')
>
> # loop to store raw data from a numerical operation;
> # rand() is given as an example here
> for i in range(nrows):
>     for j in range(ncolumns):
>         data_1[i,j] = rand()
>         data_2[i,j] = rand()
>         data_3[i,j] = rand()
>
> # store squared sum of each column from data_x
> for k in range(ncolumns):
>     results[0,k] = dot(data_1[:,k], data_1[:,k])
>     results[1,k] = dot(data_2[:,k], data_2[:,k])
>     results[2,k] = dot(data_3[:,k], data_3[:,k])
>
> print results

import numpy as np

data = np.random.rand(ndata,nrows,ncolumns)
results = (data**2).sum(axis=0)

or even

results = (np.random.rand(ndata,nrows,ncolumns)**2).sum(axis=0)

That last operation, which I have written as (data**2).sum(axis=0) is
kind of an embarrassment; dot() or its cousin tensordot() would be
more efficient, but they don't have a suitable "elementwise"
implementation. Nevertheless,  squaring and then summing gives the
right answer.

Anne



More information about the SciPy-User mailing list