[Numpy-discussion] 2D binning

Wes McKinney wesmckinn at gmail.com
Wed Jun 2 13:45:58 EDT 2010


On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates <mat.yeates at gmail.com> wrote:
> thanks. I am also getting an error in ndi.mean
> Were you getting the error
> "RuntimeError: data type not supported"?
>
> -Mathew
>
> On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut <schut at sarvision.nl> wrote:
>> > On 06/02/2010 04:52 AM, josef.pktd at gmail.com wrote:
>> >> On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus<zachary.pincus at yale.edu>
>> >>  wrote:
>> >>>> I guess it's as fast as I'm going to get. I don't really see any
>> >>>> other way. BTW, the lat/lons are integers)
>> >>>
>> >>> You could (in c or cython) try a brain-dead "hashtable" with no
>> >>> collision detection:
>> >>>
>> >>> for lat, long, data in dataset:
>> >>>    bin = (lat ^ long) % num_bins
>> >>>    hashtable[bin] = update_incremental_mean(hashtable[bin], data)
>> >>>
>> >>> you'll of course want to do some experiments to see if your data are
>> >>> sufficiently sparse and/or you can afford a large enough hashtable
>> >>> array that you won't get spurious hash collisions. Adding error-
>> >>> checking to ensure that there are no collisions would be pretty
>> >>> trivial (just keep a table of the lat/long for each hash value, which
>> >>> you'll need anyway, and check that different lat/long pairs don't get
>> >>> assigned the same bin).
>> >>>
>> >>> Zach
>> >>>
>> >>>
>> >>>
>> >>>> -Mathew
>> >>>>
>> >>>> On Tue, Jun 1, 2010 at 1:49 PM, Zachary
>> >>>> Pincus<zachary.pincus at yale.edu
>> >>>>> wrote:
>> >>>>> Hi
>> >>>>> Can anyone think of a clever (non-lopping) solution to the
>> >>>> following?
>> >>>>>
>> >>>>> A have a list of latitudes, a list of longitudes, and list of data
>> >>>>> values. All lists are the same length.
>> >>>>>
>> >>>>> I want to compute an average  of data values for each lat/lon pair.
>> >>>>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
>> >>>>> data[1001] = (data[1001] + data[2001])/2
>> >>>>>
>> >>>>> Looping is going to take wayyyy to long.
>> >>>>
>> >>>> As a start, are the "equal" lat/lon pairs exactly equal (i.e. either
>> >>>> not floating-point, or floats that will always compare equal, that
>> >>>> is,
>> >>>> the floating-point bit-patterns will be guaranteed to be identical)
>> >>>> or
>> >>>> approximately equal to float tolerance?
>> >>>>
>> >>>> If you're in the approx-equal case, then look at the KD-tree in scipy
>> >>>> for doing near-neighbors queries.
>> >>>>
>> >>>> If you're in the exact-equal case, you could consider hashing the
>> >>>> lat/
>> >>>> lon pairs or something. At least then the looping is O(N) and not
>> >>>> O(N^2):
>> >>>>
>> >>>> import collections
>> >>>> grouped = collections.defaultdict(list)
>> >>>> for lt, ln, da in zip(lat, lon, data):
>> >>>>    grouped[(lt, ln)].append(da)
>> >>>>
>> >>>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in
>> >>>> grouped.items())
>> >>>>
>> >>>> Is that fast enough?
>> >>
>> >> If the lat lon can be converted to a 1d label as Wes suggested, then
>> >> in a similar timing exercise ndimage was the fastest.
>> >> http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
>> >
>> > And as you said your lats and lons are integers, you could simply do
>> >
>> > ll = lat*1000 + lon
>> >
>> > to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or
>> > lon will never exceed 360 (degrees).
>> >
>> > After that, either use the ndimage approach, or you could use
>> > histogramming with weighting by data values and divide by histogram
>> > withouth weighting, or just loop.
>> >
>> > Vincent
>> >
>> >>
>> >> (this was for python 2.4, also later I found np.bincount which
>> >> requires that the labels are consecutive integers, but is as fast as
>> >> ndimage)
>> >>
>> >> I don't know how it would compare to the new suggestions.
>> >>
>> >> Josef
>> >>
>> >>
>> >>
>> >>>>
>> >>>> Zach
>> >>>> _______________________________________________
>> >>>> NumPy-Discussion mailing list
>> >>>> NumPy-Discussion at scipy.org
>> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>>>
>> >>>> _______________________________________________
>> >>>> NumPy-Discussion mailing list
>> >>>> NumPy-Discussion at scipy.org
>> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>>
>> >>> _______________________________________________
>> >>> NumPy-Discussion mailing list
>> >>> NumPy-Discussion at scipy.org
>> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>>
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>>
>> I was curious about how fast ndimage was for this operation so here's
>> the complete function.
>>
>> import scipy.ndimage as ndi
>>
>> N = 10000
>>
>> lat = np.random.randint(0, 360, N)
>> lon = np.random.randint(0, 360, N)
>> data = np.random.randn(N)
>>
>> def group_mean(lat, lon, data):
>>    indexer = np.lexsort((lon, lat))
>>    lat = lat.take(indexer)
>>    lon = lon.take(indexer)
>>    sorted_data = data.take(indexer)
>>
>>    keys = 1000 * lat + lon
>>    unique_keys = np.unique(keys)
>>
>>    result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
>>    decoder = keys.searchsorted(unique_keys)
>>
>>    return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))
>>
>> Appears to be about 13x faster (and could be made faster still) than
>> the naive version on my machine:
>>
>> def group_mean_naive(lat, lon, data):
>>    grouped = collections.defaultdict(list)
>>    for lt, ln, da in zip(lat, lon, data):
>>      grouped[(lt, ln)].append(da)
>>
>>    averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())
>>
>>    return averaged
>>
>> I had to get the latest scipy trunk to not get an error from ndimage.mean
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

That's the error I was getting. Depending on your OS upgrading to the
scipy trunk should be the easiest fix.



More information about the NumPy-Discussion mailing list