[Numpy-discussion] efficient 3d histogram creation

Tue May 5 09:46:38 EDT 2009

On Mon, May 4, 2009 at 4:18 PM, <josef.pktd at gmail.com> wrote:

> On Mon, May 4, 2009 at 4:00 PM, Chris Colbert <sccolbert at gmail.com> wrote:
> > i'll take a look at them over the next few days and see what i can hack
> out.
> >
> > Chris
> >
> > On Mon, May 4, 2009 at 3:18 PM, David Huard <david.huard at gmail.com>
> wrote:
> >>
> >>
> >> On Mon, May 4, 2009 at 7:00 AM, <josef.pktd at gmail.com> wrote:
> >>>
> >>> On Mon, May 4, 2009 at 12:31 AM, Chris Colbert <sccolbert at gmail.com>
> >>> wrote:
> >>> > this actually sort of worked. Thanks for putting me on the right
> track.
> >>> >
> >>> > Here is what I ended up with.
> >>> >
> >>> > this is what I ended up with:
> >>> >
> >>> > def hist3d(imgarray):
> >>> >     histarray = N.zeros((16, 16, 16))
> >>> >     temp = imgarray.copy()
> >>> >     bins = N.arange(0, 257, 16)
> >>> >     histarray = N.histogramdd((temp[:,:,0].ravel(),
> >>> > temp[:,:,1].ravel(),
> >>> > temp[:,:,2].ravel()), bins=(bins, bins, bins))[0]
> >>> >     return histarray
> >>> >
> >>> > this creates a 3d histogram of rgb image values in the range 0,255
> >>> > using 16
> >>> > bins per component color.
> >>> >
> >>> > on a 640x480 image, it executes in 0.3 seconds vs 4.5 seconds for a
> for
> >>> > loop.
> >>> >
> >>> > not quite framerate, but good enough for prototyping.
> >>> >
> >>>
> >>> I don't think your copy to temp is necessary, and use reshape(-1,3) as
> >>> in the example of Stefan, which will avoid copying the array 3 times.
> >>>
> >>> If you need to gain some more speed, then rewriting histogramdd and
> >>> removing some of the unnecessary checks and calculations looks
> >>> possible.
> >>
> >> Indeed, the strategy used in the histogram function is faster than the
> one
> >> used in the histogramdd case, so porting one to the other should speed
> >> things up.
> >>
> >> David
>
> is searchsorted faster than digitize and bincount ?
>

That depends on the number of bins and whether or not the bin width is
uniform. A 1D benchmark I did a while ago showed that if the bin width is
uniform, then the best strategy is to create a counter initialized to 0,
loop through the data, compute i = (x-bin0) /binwidth and increment counter
i by 1 (or by the weight of the data). If the bins are non uniform, then for
nbin > 30 you'd better use searchsort, and digitize otherwise.

For those interested in speeding up histogram code, I recommend reading a
thread started by Cameron Walsh on the 12/12/06 named "Histograms of
extremely large data sets" Code and benchmarks were posted.

Chris, if your bins all have the same width, then you can certainly write an
histogramdd routine that is way faster by using the indexing trick instead
of digitize or searchsort.

Cheers,

David

>
> Using the idea of histogramdd, I get a bit below a tenth of a second,
> my best for this problem is below.
> I was trying for a while what the fastest way is to convert a two
> dimensional array into a one dimensional index for bincount. I found
> that using the return index of unique1d is very slow compared to
> numeric index calculation.
>
> Josef
>
> example timed for:
> nobs = 307200
> nbins = 16
> factors = np.random.randint(256,size=(nobs,3)).copy()
> factors2 = factors.reshape(-1,480,3).copy()
>
> def hist3(factorsin, nbins):
>    if factorsin.ndim != 2:
>        factors = factorsin.reshape(-1,factorsin.shape[-1])
>    else:
>        factors = factorsin
>    N, D = factors.shape
>    darr = np.empty(factors.T.shape, dtype=int)
>    nele = np.max(factors)+1
>    bins = np.arange(0, nele, nele/nbins)
>    bins[-1] += 1
>    for i in range(D):
>        darr[i] = np.digitize(factors[:,i],bins) - 1
>
>    #add weighted rows
>    darrind = darr[D-1]
>    for i in range(D-1):
>        darrind += darr[i]*nbins**(D-i-1)
>    return np.bincount(darrind)  # return flat not reshaped
>

> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090505/15389e1b/attachment.html>