[Numpy-discussion] Histograms of extremely large data sets

Thu Dec 14 12:55:00 EST 2006

This same idea could be used to parallelize the histogram computation.
 Then you could really get into large (many Gb/TB/PB) data sets.  I
might try to find time to do this with ipython1, but someone else
could do this as well.

Brian

On 12/13/06, Rick White <rlw at stsci.edu> wrote:
> On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote:
>
> > I'm trying to generate histograms of extremely large datasets.  I've
> > tried a few methods, listed below, all with their own shortcomings.
> > Mailing-list archive and google searches have not revealed any
> > solutions.
>
> The numpy.histogram function can be modified to use memory much more
> efficiently when the input array is large, and the modification turns
> out to be faster even for smallish arrays (in my tests, anyway).
> Below is a modified version of the histogram function from
> function_base.py.  It is almost identical, but it avoids doing the
> sort of the entire input array simply by dividing it into blocks.
> (It would be even better to avoid the call to ravel too.)  The only
> other messy detail is that the builtin range function is shadowed by
> the 'range' parameter.
>
> In my timing tests this is about the same speed for arrays about the
> same size as the block size and is faster than the current version by
> 30-40% for large arrays.  The speed difference increases as the array
> size increases.
>
> I haven't compared this to Eric's weave function, but this has the
> advantages of being pure Python and of being much simpler.  On my
> machine (MacBook Pro) it takes about 4 seconds for an array with 100
> million elements.  The time increases perfectly linearly with array
> size for arrays larger than a million elements.
>                                         Rick
>
> from numpy import *
>
> lrange = range
> def histogram(a, bins=10, range=None, normed=False):
>      a = asarray(a).ravel()
>      if not iterable(bins):
>          if range is None:
>              range = (a.min(), a.max())
>          mn, mx = [mi+0.0 for mi in range]
>          if mn == mx:
>              mn -= 0.5
>              mx += 0.5
>          bins = linspace(mn, mx, bins, endpoint=False)
>
>      # best block size probably depends on processor cache size
>      block = 65536
>      n = sort(a[:block]).searchsorted(bins)
>      for i in lrange(block,len(a),block):
>          n += sort(a[i:i+block]).searchsorted(bins)
>      n = concatenate([n, [len(a)]])
>      n = n[1:]-n[:-1]
>
>      if normed:
>          db = bins[1] - bins[0]
>          return 1.0/(a.size*db) * n, bins
>      else:
>          return n, bins
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>