[SciPy-Dev] splitting an ordered list as evenly as possilbe

Wed Aug 25 11:24:06 EDT 2010

On Wed, Aug 25, 2010 at 9:00 AM, John Hunter <jdh2358 at gmail.com> wrote:

> Suppose I have an ordered list/array of numbers, and I want to split
> them into N chunks, such that the intersection of any chunk with each
> other is empty and the data is split as evenly as possible (eg the std
> dev of the lengths of the chunks is minimized or some other such
> criterion).  Context: I am trying to do a quintile analysis on some
> data, and np.percentile doesn't behave like I want because more than
> 20% of my data equals 1, so 1  is in the first and second quintiles.
> I want to avoid this -- I'd rather have uneven counts in my quintiles
> than have the same value show up in multiple quintiles, but I'd like
> the counts to be as even as possible..
>
> Here is some sample code that illustrates my problem:
>
> In [178]: run ~/test
>
> tile i=1 range=[1.00, 1.00), count=0
> tile i=2 range=[1.00, 3.00), count=79
> tile i=3 range=[3.00, 4.60), count=42
> tile i=4 range=[4.60, 11.00), count=39
> tile i=5 range=[11.00, 43.00), count=41
>
>
> import numpy as np
>
> x = np.array([  2.,   3.,   4.,   5.,   1.,   2.,   1.,   1.,   1.,   2.,
> 3.,
>         1.,   2.,   3.,   1.,   2.,   3.,   1.,   2.,   3.,   4.,   1.,
>         1.,   2.,   3.,   2.,   2.,   3.,   4.,   5.,   1.,   2.,   3.,
>         4.,   5.,   6.,   7.,   1.,   1.,   2.,   3.,   4.,   5.,   6.,
>         7.,   1.,   2.,   3.,   1.,   2.,   1.,   2.,   3.,   1.,   2.,
>         4.,   1.,   2.,   1.,   2.,   3.,   4.,   5.,   6.,   1.,   2.,
>         3.,   1.,   1.,   1.,   1.,   1.,   1.,   2.,   1.,   2.,   3.,
>         1.,   2.,   3.,   1.,   1.,   1.,   2.,   3.,   4.,   5.,   6.,
>         7.,   8.,   9.,  10.,   1.,   1.,   2.,   3.,   1.,   2.,   3.,
>         4.,   5.,   6.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,
>         9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,
>        20.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,
>        31.,  32.,  33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,
>        42.,  43.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   1.,   2.,
>         3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,
>        14.,  15.,  16.,  17.,  18.,  19.,   1.,   2.,   1.,   2.,   3.,
>         4.,   5.,   6.,   1.,   2.,   3.,   4.,   5.,   6.,   1.,   2.,
>         3.,   4.,   1.,   2.,   3.,   4.,   5.,   6.,   1.,   2.,   3.,
>         1.,   2.,   1.,   2.])
>
>
> tiles = np.percentile(x, (0, 20, 40, 60, 80, 100))
>
> print
> for i in range(1, len(tiles)):
>    xmin, xmax = tiles[i-1], tiles[i]
>    print 'tile i=%d range=[%.2f, %.2f), count=%d'%(i, xmin, xmax,
> ((x>=xmin) & (x<xmax)).sum())
>

Just a crazy thought, but maybe kmeans clustering might be what you are
looking for?  If you know ahead of time the number of bins you want, you can
let kmeans try and group things automatically.  The ones will all fall into
the same membership (and any other duplicated values will, too).  If you
sort the data first, then the behavior should be consistent.

I once used kmeans to help "snap" height data from multiple observations
together onto a common set of heights.  The obs would have many zero height
values, but then the rest of the values would not have many repeated
values.  This approach worked great in our particular application.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20100825/df5b7713/attachment.html>