[Numpy-discussion] Histograms of extremely large data sets

Wed Dec 13 23:40:03 EST 2006

On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote:

> I'm trying to generate histograms of extremely large datasets.  I've
> tried a few methods, listed below, all with their own shortcomings.
> Mailing-list archive and google searches have not revealed any
> solutions.

The numpy.histogram function can be modified to use memory much more  
efficiently when the input array is large, and the modification turns  
out to be faster even for smallish arrays (in my tests, anyway).   
Below is a modified version of the histogram function from  
function_base.py.  It is almost identical, but it avoids doing the  
sort of the entire input array simply by dividing it into blocks.   
(It would be even better to avoid the call to ravel too.)  The only  
other messy detail is that the builtin range function is shadowed by  
the 'range' parameter.

In my timing tests this is about the same speed for arrays about the  
same size as the block size and is faster than the current version by  
30-40% for large arrays.  The speed difference increases as the array  
size increases.

I haven't compared this to Eric's weave function, but this has the  
advantages of being pure Python and of being much simpler.  On my  
machine (MacBook Pro) it takes about 4 seconds for an array with 100  
million elements.  The time increases perfectly linearly with array  
size for arrays larger than a million elements.
					Rick

from numpy import *

lrange = range
def histogram(a, bins=10, range=None, normed=False):
     a = asarray(a).ravel()
     if not iterable(bins):
         if range is None:
             range = (a.min(), a.max())
         mn, mx = [mi+0.0 for mi in range]
         if mn == mx:
             mn -= 0.5
             mx += 0.5
         bins = linspace(mn, mx, bins, endpoint=False)

     # best block size probably depends on processor cache size
     block = 65536
     n = sort(a[:block]).searchsorted(bins)
     for i in lrange(block,len(a),block):
         n += sort(a[i:i+block]).searchsorted(bins)
     n = concatenate([n, [len(a)]])
     n = n[1:]-n[:-1]

     if normed:
         db = bins[1] - bins[0]
         return 1.0/(a.size*db) * n, bins
     else:
         return n, bins