[Numpy-discussion] Histograms of extremely large data sets
Cameron Walsh
cameron.walsh at gmail.com
Tue Dec 12 22:27:34 EST 2006
Hi all,
I'm trying to generate histograms of extremely large datasets. I've
tried a few methods, listed below, all with their own shortcomings.
Mailing-list archive and google searches have not revealed any
solutions.
Method 1:
import numpy
import matplotlib
data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.
histogram = pylab.hist(data, bins=range(0,256))
pylab.xlim(0,256)
pylab.show()
The problem with this method is it appears to never finish. It is
however, extremely fast for smaller data sets, like 5x1000x1000 (1-2
seconds) instead of 500x1000x1000.
Method 2:
import numpy
import matplotlib
data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.
bins=numpy.zeros((256),dtype="uint32")
for val in data.flat:
bins[val]+=1
barchart = pylab.bar(xrange(256),bins,align="center")
pylab.xlim(0,256)
pylab.show()
The problem with this method is it is incredibly slow, taking up to 30
seconds for a 1x1000x1000 sample, I have neither the patience nor the
inclination to time a 500x1000x1000 sample.
Method 3:
import numpy
data=numpy.empty((489,1000,1000),dtype="uint8")
# Replace this line with actual data samples, but the size and types
are correct.
a=numpy.histogram(data,256)
The problem with this one is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.5/site-packages/numpy/lib/function_base.py",
line 96, in histogram
n = sort(a).searchsorted(bins)
ValueError: dimensions too large.
It seems that iterating over the entire array and doing it manually is
the slowest possible method, but that the rest are not much better.
Is there a faster method available, or do I have to implement method 2
in C and submit the change as a patch?
Thanks and best regards,
Cameron.
More information about the NumPy-Discussion
mailing list