counting unique numpy subarrays

Fri Dec 4 19:18:04 EST 2015

On 04/12/15 23:06, Peter Otten wrote:
> duncan smith wrote:
> 
>> Hello,
>>       I'm trying to find a computationally efficient way of identifying
>> unique subarrays, counting them and returning an array containing only
>> the unique subarrays and a corresponding 1D array of counts. The
>> following code works, but is a bit slow.
>>
>> ###############
>>
>> from collections import Counter
>> import numpy
>>
>> def bag_data(data):
>>     # data (a numpy array) is bagged along axis 0
>>     # returns concatenated array and corresponding array of counts
>>     vec_shape = data.shape[1:]
>>     counts = Counter(tuple(arr.flatten()) for arr in data)
>>     data_out = numpy.zeros((len(counts),) + vec_shape)
>>     cnts = numpy.zeros((len(counts,)))
>>     for i, (tup, cnt) in enumerate(counts.iteritems()):
>>         data_out[i] = numpy.array(tup).reshape(vec_shape)
>>         cnts[i] =  cnt
>>     return data_out, cnts
>>
>> ###############
>>
>> I've been looking through the numpy docs, but don't seem to be able to
>> come up with a clean solution that avoids Python loops. 
> 
> Me neither :(
> 
>> TIA for any
>> useful pointers. Cheers.
> 
> Here's what I have so far:
> 
> def bag_data(data):
>     counts = numpy.zeros(data.shape[0])
>     seen = {}
>     for i, arr in enumerate(data):
>         sarr = arr.tostring()
>         if sarr in seen:
>             counts[seen[sarr]] += 1
>         else:
>             seen[sarr] = i
>             counts[i] = 1
>     nz = counts != 0
>     return numpy.compress(nz, data, axis=0), numpy.compress(nz, counts)
> 

Three times as fast as what I had, and a bit cleaner. Excellent. Cheers.

Duncan