[Numpy-discussion] 2D binning

Tue Jun 1 22:52:34 EDT 2010

On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus <zachary.pincus at yale.edu> wrote:
>> I guess it's as fast as I'm going to get. I don't really see any
>> other way. BTW, the lat/lons are integers)
>
> You could (in c or cython) try a brain-dead "hashtable" with no
> collision detection:
>
> for lat, long, data in dataset:
>   bin = (lat ^ long) % num_bins
>   hashtable[bin] = update_incremental_mean(hashtable[bin], data)
>
> you'll of course want to do some experiments to see if your data are
> sufficiently sparse and/or you can afford a large enough hashtable
> array that you won't get spurious hash collisions. Adding error-
> checking to ensure that there are no collisions would be pretty
> trivial (just keep a table of the lat/long for each hash value, which
> you'll need anyway, and check that different lat/long pairs don't get
> assigned the same bin).
>
> Zach
>
>
>
>> -Mathew
>>
>> On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus <zachary.pincus at yale.edu
>> > wrote:
>> > Hi
>> > Can anyone think of a clever (non-lopping) solution to the
>> following?
>> >
>> > A have a list of latitudes, a list of longitudes, and list of data
>> > values. All lists are the same length.
>> >
>> > I want to compute an average  of data values for each lat/lon pair.
>> > e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
>> > data[1001] = (data[1001] + data[2001])/2
>> >
>> > Looping is going to take wayyyy to long.
>>
>> As a start, are the "equal" lat/lon pairs exactly equal (i.e. either
>> not floating-point, or floats that will always compare equal, that is,
>> the floating-point bit-patterns will be guaranteed to be identical) or
>> approximately equal to float tolerance?
>>
>> If you're in the approx-equal case, then look at the KD-tree in scipy
>> for doing near-neighbors queries.
>>
>> If you're in the exact-equal case, you could consider hashing the lat/
>> lon pairs or something. At least then the looping is O(N) and not
>> O(N^2):
>>
>> import collections
>> grouped = collections.defaultdict(list)
>> for lt, ln, da in zip(lat, lon, data):
>>   grouped[(lt, ln)].append(da)
>>
>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in
>> grouped.items())
>>
>> Is that fast enough?

If the lat lon can be converted to a 1d label as Wes suggested, then
in a similar timing exercise ndimage was the fastest.
http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html

(this was for python 2.4, also later I found np.bincount which
requires that the labels are consecutive integers, but is as fast as
ndimage)

I don't know how it would compare to the new suggestions.

Josef

>>
>> Zach
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>