[Numpy-discussion] Histograms of extremely large data sets

Cameron Walsh cameron.walsh at gmail.com
Tue Dec 12 22:27:34 EST 2006


Hi all,

I'm trying to generate histograms of extremely large datasets.  I've
tried a few methods, listed below, all with their own shortcomings.
Mailing-list archive and google searches have not revealed any
solutions.

Method 1:

import numpy
import matplotlib

data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.

histogram = pylab.hist(data, bins=range(0,256))
pylab.xlim(0,256)
pylab.show()

The problem with this method is it appears to never finish.  It is
however, extremely fast for smaller data sets, like 5x1000x1000 (1-2
seconds) instead of 500x1000x1000.


Method 2:

import numpy
import matplotlib

data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.

bins=numpy.zeros((256),dtype="uint32")
   for val in data.flat:
       bins[val]+=1
barchart = pylab.bar(xrange(256),bins,align="center")
pylab.xlim(0,256)
pylab.show()

The problem with this method is it is incredibly slow, taking up to 30
seconds for a 1x1000x1000 sample, I have neither the patience nor the
inclination to time a 500x1000x1000 sample.


Method 3:

import numpy

data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.

a=numpy.histogram(data,256)


The problem with this one is:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/local/lib/python2.5/site-packages/numpy/lib/function_base.py",
line 96, in histogram
   n = sort(a).searchsorted(bins)
ValueError: dimensions too large.


It seems that iterating over the entire array and doing it manually is
the slowest possible method, but that the rest are not much better.
Is there a faster method available, or do I have to implement method 2
in C and submit the change as a patch?

Thanks and best regards,

Cameron.



More information about the NumPy-Discussion mailing list