finding most common elements between thousands of multiple arrays.

Sat Jul 4 18:50:23 EDT 2009

mclovin wrote:
> On Jul 4, 12:51 pm, Scott David Daniels <Scott.Dani... at Acm.Org> wrote:
>> mclovin wrote:
>>> OK then. I will try some of the strategies here but I guess things
>>> arent looking too good. I need to run this over a dataset that someone
>>> pickled. I need to run this 480,000 times so you can see my
>>> frustration. So it doesn't need to be "real time" but it would be nice
>>> it was done sorting this month.
>>> Is there a "bet guess" strategy where it is not 100% accurate but much
>>> faster?
>>
>> Well, I timed a run of a version of mine, and the scan is approx 5X
>> longer than the copy-and-sort.  Time arr_of_arr.flatten().sort() to
>> see how quickly the copy and sort happens.So you could try a variant
>> exploiting the following property:
>>      If you know the minimum length of a run that will be in the top 25,
>> then the value for each of the most-frequent run entries must show up at
>> positions n * stride and (n + 1) * stride (for some n).  That should
>> drastically reduce the scan cost, as long as stride is reasonably large....
>>
>> sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
>> So there are only 1000 points to investigate.
>> With any distribution other than uniform, that should go _way_ down.
>> So just pull out those points, use bisect to get their frequencies, and
>> feed those results into the heap accumulation.
>>
>> --Scott David Daniels
> 
> I dont quite understand what you are saying but I know this: the times
> the most common element appears varies greatly. Sometimes it appears
> over 1000 times, and some times it appears less than 50. It all
> depends on the density of the arrays I am analyzing.

Here's a heuristic replacement for my previous frequency code:
I've tried to mark where you could fudge numbers if the run time
is at all close.

def frequency(arr_of_arr, N, stride=100)
     '''produce (freq, value) pairs for data in arr_of_arr.

     Tries to produce > N pairs.  stride is a guess at half
     the length of the shortest run in the top N runs.
     '''
     # if the next two lines are too slow, this whole approach is toast
     data = arr_of_arr.flatten()  # big allocation
     data.sort() # a couple of seconds for 25 million ints

     # stride is a length forcing examination of a run.
     sampled = data[::stride]
     # Note this is a view into data, and is still sorted.
     # We know that any run of length 2 * stride - 1 in data _must_ have
     # consecutive entries in sampled.  Compare them "in parallel"
     matches = sampled[:-1] == sampled[1:]
     # matches is True or False for stride-separated values from sampled
     candidates = sum(matches) # count identified matches

     # while candidates is huge, keep trying with a larger stride
     while candidates > N *10: # 10 -- heuristic
         stride *= 2 # # heuristic increase
         sampled = data[::stride]
         matches = sampled[:-1] == sampled[1:]
         candidates = sum(matches)

     # if we got too few, move stride down:
     while candidates < N * 3: # heuristic slop for long runs
         stride //= 2 # heuristic decrease
         sampled = data[::stride]
         matches = sampled[:-1] == sampled[1:]
         candidates = sum(matches)

     # Here we have a "nice" list of candidates that is likely
     # to include every run we actually want. sampled[matches] is
     # the sorted list of candidate values.  It may have duplicates
     former = None
     past = 0
     # In the loop here we only use sampled to the pick values we
     # then go find in data.  We avoid checking for same value twice
     for value in sampled[matches]:
         if value == former:
             continue # A long run: multiple matches in sampled
         former = value # Make sure we only try this one once
         # find the beginning of the run
         start = bisect.bisect_left(data, value, past)
         # find the end of the run (we know it is at least stride long)
         past = bisect.bisect_right(data, value, start + stride)
         yield past - start, value # produce frequency, value data

--Scott David Daniels
Scott.Daniels at Acm.Org