[SciPy-Dev] splitting an ordered list as evenly as possilbe

Wed Aug 25 12:08:54 EDT 2010

On Wed, Aug 25, 2010 at 11:44 AM, Bruce Southey <bsouthey at gmail.com> wrote:
>  On 08/25/2010 09:44 AM, josef.pktd at gmail.com wrote:
>> On Wed, Aug 25, 2010 at 10:32 AM, Keith Goodman<kwgoodman at gmail.com>  wrote:
>>> On Wed, Aug 25, 2010 at 7:19 AM, John Hunter<jdh2358 at gmail.com>  wrote:
>>>> On Wed, Aug 25, 2010 at 9:10 AM, Keith Goodman<kwgoodman at gmail.com>  wrote:
>>>>
>>>>> How about using the percentiles of np.unique(x)? That takes care of
>>>>> the first constraint (no overlap) but ignores the second constraint
>>>>> (min std of cluster size).
>>>> Well, I need the 2nd constraint....
>>> Both can't be hard constraints, so I guess the first step is to define
>>> a utility function that quantifies the trade off between the two.
>>> Would it make sense to then start from the percentile(unique(x), ...)
>>> solution and come up with a heuristic that moves an item with lots of
>>> repeats in a large length quintile to a short lenght quintile and then
>>> accept the moves if it improves the utility? Or try moving each item
>>> to each of the other 4 quintiles and do the move the improves the
>>> utility the most. Then repeat until the utility doesn't improve. But I
>>> guess I'm just stating the obvious and you are looking for something
>>> less obvious and more clever.
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>> What I'm doing for some statistical analysis, e.g. chisquare test with
>> integer data (discrete random variable)?
>>
>> np.bincount to get the full count, or use theoretical pdf,
>> then loop over the integers (raw bins) and merge them to satisfy the
>> constraints.
>>
>> constraints that I'm using are equal binsizes in one version and
>> minimum binsizes in the second version.
>>
>> I haven't found anything else than the loop over the uniques, but I
>> think there was some discussion on this some time ago on a mailing
>> list.
>>
>> Josef
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
> As others have indicated you have to work with the unique values as well
> as the frequencies.
>
> Hopefully you can determine what I mean from the code below and modify
> it as needed. It is brute force but provides a couple of options as the
> following output indicates.
>
> 3 [(2, 44), (3, 35), (5, 42), (13, 43), (43, 38)]
> 4 [(2, 44), (3, 35), (5, 42), (14, 45), (43, 36)]
> 5 [(2, 44), (3, 35), (5, 42), (14, 45), (43, 36)]
> 6 [(2, 44), (3, 35), (5, 42), (15, 47), (43, 34)]
> 7 [(2, 44), (3, 35), (5, 42), (15, 47), (43, 34)]
> 8 [(2, 44), (3, 35), (5, 42), (16, 49), (43, 32)]
> 9 [(2, 44), (3, 35), (5, 42), (16, 49), (43, 32)]
>
>
> Some notes:
> 1) For this example, you need an average of 41 per group (202 elements
> divided by 5). But that will be impossible because the value '1' has a
> frequency of 44, the sum of frequencies of '2' and '3' is 61. This means
> we need some way to allow slight increases in sizes - I use the variable
> eval which is the expected count plus some threshold (berror).
>
> If you have floats then you can not use np.bincount directly. So if
> these are integers use them directly or use some function to create
> these in the desirable range (such as np.ceil or work with 10*x etc.)
>
> Bruce
>
> binx=np.bincount(x.astype(int))
> for berror in range(10): # loop over a range of possible variations in
> the counts
>     eval=berror+np.ceil(binx.sum()/5.0) # find a count threshold
>     count=0
>     quintile=[]
>     for i in range(binx.shape[0]): #loop over the frequencies to
> determine which bin
>         if count+binx[i] > (eval): # If the bin overflows then start a
> new one
>             quintile.append((i, count))
>             count=binx[i]
>         else: #other keep adding into current bin
>             count +=binx[i]
>     quintile.append((i, count)) #add the last bin
>     if len(quintile)==5: # we must have five bins otherwise that loop
> is useless. You can also apply other criteria here as well.
>         print berror, quintile

I don't think you can assume anything about the minimum number of bins
in general. For example, my similar code needed to work also for
binary distributions with at most two unique values and bins. A
degenerate distribution would have only a single value and bin.

eval is a built-in function

Josef

>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>