[Numpy-discussion] Histograms of extremely large data sets

Thu Dec 14 14:27:35 EST 2006

I just noticed a bug in this code.  "PyArray_ITER_NEXT(iter);" should be moved out of the if statement.

eric

eric jones wrote:
>
>
> Rick White wrote:
>> Just so we don't get too smug about the speed, if I do this in IDL 
>> on  the same machine it is 10 times faster (0.28 seconds instead of 
>> 4  seconds).  I'm sure the IDL version uses the much faster approach 
>> of  just sweeping through the array once, incrementing counts in the  
>> appropriate bins.  It only handles equal-sized bins, so it is not as  
>> general as the numpy version -- but equal-sized bins is a very 
>> common  case.  I'd still like to see a C version of histogram (which 
>> I guess  would need to be a ufunc) go into the core numpy.
>>   
> Yes, this gets rid of the search, and indices can just be caluclated 
> from offsets.  I've attached a modified weaved histogram that takes 
> this approach.  Running the snippet below on my machine takes .118 sec 
> for the evenly binned weave algorithm and 0.385 sec for Rick's 
> algorithm on 5 million elements.  That is close to 4x  faster (but not 
> 10x...), so there is indeed some speed to be gained for the common 
> special case.  I don't know if the code I wrote has a 2x gain left in 
> it, but I've spent zero time optimizing it.  I'd bet it can be 
> improved substantially.
>
> eric
>
> ### test_weave_even_histogram.py
>
> from numpy import arange, product, sum, zeros, uint8
> from numpy.random import randint
>
> import weave_even_histogram
>
> import time
>
> shape = 1000,1000,5
> size = product(shape)
> data = randint(0,256,size).astype(uint8)
> bins = arange(256+1)
>
> print 'type:', data.dtype
> print 'millions of elements:', size/1e6
>
> bin_start = 0
> bin_size = 1
> bin_count = 256
> t1 = time.clock()
> res = weave_even_histogram.histogram(data, bin_start, bin_size, 
> bin_count)
> t2 = time.clock()
> print 'sec (evenly spaced):', t2-t1, sum(res)
> print res
>
>
>>                     Rick
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>
>>   
>
> ------------------------------------------------------------------------
>
> from numpy import array, zeros, asarray, sort, int32
> from scipy import weave
> from typed_array_converter import converters
>
> def histogram(ary, bin_start, bin_size, bin_count):
>     
>     ary = asarray(ary)
>     
>     # Create an array to hold the histogram count results.
>     results = zeros(bin_count,dtype=int32)
>     
>     # The C++ code that actually does the histogramming.    
>     code = """
>            PyArrayIterObject *iter = (PyArrayIterObject*)PyArray_IterNew(py_ary);
>            
>            while(iter->index < iter->size)
>            {
>            
>                //////////////////////////////////////////////////////////
>                // binary search
>                //////////////////////////////////////////////////////////
>                
>                // This requires an update to weave 
>                ary_data_type value = *((ary_data_type*)iter->dataptr);
>                if (value>=bin_start)
>                {
>                    int bin_index = (int)((value-bin_start)/bin_size);
>                
>                    //////////////////////////////////////////////////////////
>                    // Bin counter increment
>                    //////////////////////////////////////////////////////////
>     
>                    // If the value was found, increment the counter for that bin.
>                    if (bin_index < bin_count)
>                    {
>                        results[bin_index]++;
>                    }    
>                    PyArray_ITER_NEXT(iter);
>                }    
>            }
>            """
>     weave.inline(code, ['ary', 'bin_start', 'bin_size','bin_count', 'results'], 
>                  type_converters=converters, 
>                  compiler='gcc')
>                  
>     return results
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>