[Numpy-discussion] Histograms of extremely large data sets

Thu Dec 14 01:03:45 EST 2006

Looks to me like Rick's version is simpler and faster.It looks like it 
offers a speed-up of about 1.6 on my machine over the weave version.  I 
believe this is because the sorting approach results in quite a few less 
compares than the algorithm I used. 

Very cool.  I vote that his version go into numpy.

eric

Rick White wrote:
> On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote:
>
>   
>> I'm trying to generate histograms of extremely large datasets.  I've
>> tried a few methods, listed below, all with their own shortcomings.
>> Mailing-list archive and google searches have not revealed any
>> solutions.
>>     
>
> The numpy.histogram function can be modified to use memory much more  
> efficiently when the input array is large, and the modification turns  
> out to be faster even for smallish arrays (in my tests, anyway).   
> Below is a modified version of the histogram function from  
> function_base.py.  It is almost identical, but it avoids doing the  
> sort of the entire input array simply by dividing it into blocks.   
> (It would be even better to avoid the call to ravel too.)  The only  
> other messy detail is that the builtin range function is shadowed by  
> the 'range' parameter.
>
> In my timing tests this is about the same speed for arrays about the  
> same size as the block size and is faster than the current version by  
> 30-40% for large arrays.  The speed difference increases as the array  
> size increases.
>
> I haven't compared this to Eric's weave function, but this has the  
> advantages of being pure Python and of being much simpler.  On my  
> machine (MacBook Pro) it takes about 4 seconds for an array with 100  
> million elements.  The time increases perfectly linearly with array  
> size for arrays larger than a million elements.
> 					Rick
>
> from numpy import *
>
> lrange = range
> def histogram(a, bins=10, range=None, normed=False):
>      a = asarray(a).ravel()
>      if not iterable(bins):
>          if range is None:
>              range = (a.min(), a.max())
>          mn, mx = [mi+0.0 for mi in range]
>          if mn == mx:
>              mn -= 0.5
>              mx += 0.5
>          bins = linspace(mn, mx, bins, endpoint=False)
>
>      # best block size probably depends on processor cache size
>      block = 65536
>      n = sort(a[:block]).searchsorted(bins)
>      for i in lrange(block,len(a),block):
>          n += sort(a[i:i+block]).searchsorted(bins)
>      n = concatenate([n, [len(a)]])
>      n = n[1:]-n[:-1]
>
>      if normed:
>          db = bins[1] - bins[0]
>          return 1.0/(a.size*db) * n, bins
>      else:
>          return n, bins
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>