[Numpy-discussion] Histograms of extremely large data sets
Brian Granger
ellisonbg.net at gmail.com
Thu Dec 14 12:55:00 EST 2006
This same idea could be used to parallelize the histogram computation.
Then you could really get into large (many Gb/TB/PB) data sets. I
might try to find time to do this with ipython1, but someone else
could do this as well.
Brian
On 12/13/06, Rick White <rlw at stsci.edu> wrote:
> On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote:
>
> > I'm trying to generate histograms of extremely large datasets. I've
> > tried a few methods, listed below, all with their own shortcomings.
> > Mailing-list archive and google searches have not revealed any
> > solutions.
>
> The numpy.histogram function can be modified to use memory much more
> efficiently when the input array is large, and the modification turns
> out to be faster even for smallish arrays (in my tests, anyway).
> Below is a modified version of the histogram function from
> function_base.py. It is almost identical, but it avoids doing the
> sort of the entire input array simply by dividing it into blocks.
> (It would be even better to avoid the call to ravel too.) The only
> other messy detail is that the builtin range function is shadowed by
> the 'range' parameter.
>
> In my timing tests this is about the same speed for arrays about the
> same size as the block size and is faster than the current version by
> 30-40% for large arrays. The speed difference increases as the array
> size increases.
>
> I haven't compared this to Eric's weave function, but this has the
> advantages of being pure Python and of being much simpler. On my
> machine (MacBook Pro) it takes about 4 seconds for an array with 100
> million elements. The time increases perfectly linearly with array
> size for arrays larger than a million elements.
> Rick
>
> from numpy import *
>
> lrange = range
> def histogram(a, bins=10, range=None, normed=False):
> a = asarray(a).ravel()
> if not iterable(bins):
> if range is None:
> range = (a.min(), a.max())
> mn, mx = [mi+0.0 for mi in range]
> if mn == mx:
> mn -= 0.5
> mx += 0.5
> bins = linspace(mn, mx, bins, endpoint=False)
>
> # best block size probably depends on processor cache size
> block = 65536
> n = sort(a[:block]).searchsorted(bins)
> for i in lrange(block,len(a),block):
> n += sort(a[i:i+block]).searchsorted(bins)
> n = concatenate([n, [len(a)]])
> n = n[1:]-n[:-1]
>
> if normed:
> db = bins[1] - bins[0]
> return 1.0/(a.size*db) * n, bins
> else:
> return n, bins
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
More information about the NumPy-Discussion
mailing list