[Numpy-discussion] Histograms of extremely large data sets

eric jones eric at enthought.com
Thu Dec 14 14:25:20 EST 2006



Rick White wrote:
> Just so we don't get too smug about the speed, if I do this in IDL on  
> the same machine it is 10 times faster (0.28 seconds instead of 4  
> seconds).  I'm sure the IDL version uses the much faster approach of  
> just sweeping through the array once, incrementing counts in the  
> appropriate bins.  It only handles equal-sized bins, so it is not as  
> general as the numpy version -- but equal-sized bins is a very common  
> case.  I'd still like to see a C version of histogram (which I guess  
> would need to be a ufunc) go into the core numpy.
>   
Yes, this gets rid of the search, and indices can just be caluclated 
from offsets.  I've attached a modified weaved histogram that takes this 
approach.  Running the snippet below on my machine takes .118 sec for 
the evenly binned weave algorithm and 0.385 sec for Rick's algorithm on 
5 million elements.  That is close to 4x  faster (but not 10x...), so 
there is indeed some speed to be gained for the common special case.  I 
don't know if the code I wrote has a 2x gain left in it, but I've spent 
zero time optimizing it.  I'd bet it can be improved substantially.

eric

### test_weave_even_histogram.py

from numpy import arange, product, sum, zeros, uint8
from numpy.random import randint

import weave_even_histogram

import time

shape = 1000,1000,5
size = product(shape)
data = randint(0,256,size).astype(uint8)
bins = arange(256+1)

print 'type:', data.dtype
print 'millions of elements:', size/1e6

bin_start = 0
bin_size = 1
bin_count = 256
t1 = time.clock()
res = weave_even_histogram.histogram(data, bin_start, bin_size, bin_count)
t2 = time.clock()
print 'sec (evenly spaced):', t2-t1, sum(res)
print res


> 					Rick
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: weave_even_histogram.py
Type: text/x-python
Size: 1726 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20061214/bf302b4c/attachment.py>


More information about the NumPy-Discussion mailing list