[Numpy-discussion] Histograms of extremely large data sets

David Huard david.huard at gmail.com
Thu Dec 14 14:45:24 EST 2006


Hi,

I spent some time a while ago on an histogram function for numpy. It uses
digitize and bincount instead of sorting the data. If I remember right, it
was significantly faster than numpy's histogram, but I don't know how it
will behave with very large data sets.

I attached the file if you want to take a look, or if you me the benchmark,
I'll add it to it and report the results.

Cheers,

David

2006/12/14, eric jones <eric at enthought.com>:
>
>
>
> Rick White wrote:
> > Just so we don't get too smug about the speed, if I do this in IDL on
> > the same machine it is 10 times faster (0.28 seconds instead of 4
> > seconds).  I'm sure the IDL version uses the much faster approach of
> > just sweeping through the array once, incrementing counts in the
> > appropriate bins.  It only handles equal-sized bins, so it is not as
> > general as the numpy version -- but equal-sized bins is a very common
> > case.  I'd still like to see a C version of histogram (which I guess
> > would need to be a ufunc) go into the core numpy.
> >
> Yes, this gets rid of the search, and indices can just be caluclated
> from offsets.  I've attached a modified weaved histogram that takes this
> approach.  Running the snippet below on my machine takes .118 sec for
> the evenly binned weave algorithm and 0.385 sec for Rick's algorithm on
> 5 million elements.  That is close to 4x  faster (but not 10x...), so
> there is indeed some speed to be gained for the common special case.  I
> don't know if the code I wrote has a 2x gain left in it, but I've spent
> zero time optimizing it.  I'd bet it can be improved substantially.
>
> eric
>
> ### test_weave_even_histogram.py
>
> from numpy import arange, product, sum, zeros, uint8
> from numpy.random import randint
>
> import weave_even_histogram
>
> import time
>
> shape = 1000,1000,5
> size = product(shape)
> data = randint(0,256,size).astype(uint8)
> bins = arange(256+1)
>
> print 'type:', data.dtype
> print 'millions of elements:', size/1e6
>
> bin_start = 0
> bin_size = 1
> bin_count = 256
> t1 = time.clock()
> res = weave_even_histogram.histogram(data, bin_start, bin_size, bin_count)
> t2 = time.clock()
> print 'sec (evenly spaced):', t2-t1, sum(res)
> print res
>
>
> >                                       Rick
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion at scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
>
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20061214/08f3fbd1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: histogram1d.py
Type: text/x-python
Size: 6826 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20061214/08f3fbd1/attachment.py>


More information about the NumPy-Discussion mailing list