[Numpy-discussion] [Newbie] Fast plotting

Tue Jan 6 09:44:42 EST 2009

Francesc Alted wrote:
> A Tuesday 06 January 2009, Franck Pommereau escrigué:
>   
>> Hi all, and happy new year!
>>
>> I'm new to NumPy and searching a way to compute from a set of points
>> (x,y) the mean value of y values associated to each distinct x value.
>> Each point corresponds to a measure in a benchmark (x = parameter,  y
>> = computation time) and I'd like to plot the graph of mean
>> computation time wrt parameter values. (I know how to plot, but not
>> how to compute mean values.)
>>
>> My points are stored as two arrays X, Y (same size).
>> In pure Python, I'd do as follows:
>>
>> s = {} # sum of y values for each distinct x (as keys)
>> n = {} # number of summed values (same keys)
>> for x, y in zip(X, Y) :
>>     s[x] = s.get(x, 0.0) + y
>>     n[x] = n.get(x, 0) + 1
>> new_x = array(list(sorted(s)))
>> new_y = array([s[x]/n[x] for x in sorted(s)])
>>
>> Unfortunately, this code is much too slow because my arrays have
>> millions of elements. But I'm pretty sure that NumPy offers a way to
>> handle this more elegantly and much faster.
>>
>> As a bonus, I'd be happy if the solution would allow me to compute
>> also standard deviation, min, max, etc.
>>     
>
> The next would do the trick:
>
> In [92]: x = np.random.randint(100,size=100)
>
> In [93]: y = np.random.rand(100)
>
> In [94]: u = np.unique(x)
>
> In [95]: means = [ y[x == i].mean() for i in u ]
>
> In [96]: stds = [ y[x == i].std() for i in u ]
>
> In [97]: maxs = [ y[x == i].max() for i in u ]
>
> In [98]: mins = [ y[x == i].min() for i in u ]
>
> and your wanted data will be in means, stds, maxs and mins lists.  This 
> approach has the drawback that you have to process the array each time 
> that you want to extract the desired info.  If what you want is to 
> always retrieve the same set of statistics, you can do this in one 
> single loop:
>
> In [99]: means, std, maxs, mins = [], [], [], []
>
> In [100]: for i in u:
>     g = y[x == i]
>     means.append(g.mean())
>     stds.append(g.std())
>     maxs.append(g.max())
>     mins.append(g.min())
>    .....:
>
> which has the same effect than above, but is much faster.
>
> Hope that helps,
>
>   
If you use Knuth's one pass approach 
(http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#III._On-line_algorithm) 
you can write a function to get the min, max, mean and variance/standard 
deviation in a single pass through the array rather than one pass for 
each. I do not know if this will provide any advantage as that will 
probably depend on the size of the arrays.

Also, please use the highest precision possible (ie float128) for your 
arrays to minimize numerical error due to the size of your arrays.

Bruce