[SciPy-Dev] splitting an ordered list as evenly as possilbe

John Hunter jdh2358 at gmail.com
Wed Aug 25 10:00:00 EDT 2010


Suppose I have an ordered list/array of numbers, and I want to split
them into N chunks, such that the intersection of any chunk with each
other is empty and the data is split as evenly as possible (eg the std
dev of the lengths of the chunks is minimized or some other such
criterion).  Context: I am trying to do a quintile analysis on some
data, and np.percentile doesn't behave like I want because more than
20% of my data equals 1, so 1  is in the first and second quintiles.
I want to avoid this -- I'd rather have uneven counts in my quintiles
than have the same value show up in multiple quintiles, but I'd like
the counts to be as even as possible..

Here is some sample code that illustrates my problem:

In [178]: run ~/test

tile i=1 range=[1.00, 1.00), count=0
tile i=2 range=[1.00, 3.00), count=79
tile i=3 range=[3.00, 4.60), count=42
tile i=4 range=[4.60, 11.00), count=39
tile i=5 range=[11.00, 43.00), count=41


import numpy as np

x = np.array([  2.,   3.,   4.,   5.,   1.,   2.,   1.,   1.,   1.,   2.,   3.,
         1.,   2.,   3.,   1.,   2.,   3.,   1.,   2.,   3.,   4.,   1.,
         1.,   2.,   3.,   2.,   2.,   3.,   4.,   5.,   1.,   2.,   3.,
         4.,   5.,   6.,   7.,   1.,   1.,   2.,   3.,   4.,   5.,   6.,
         7.,   1.,   2.,   3.,   1.,   2.,   1.,   2.,   3.,   1.,   2.,
         4.,   1.,   2.,   1.,   2.,   3.,   4.,   5.,   6.,   1.,   2.,
         3.,   1.,   1.,   1.,   1.,   1.,   1.,   2.,   1.,   2.,   3.,
         1.,   2.,   3.,   1.,   1.,   1.,   2.,   3.,   4.,   5.,   6.,
         7.,   8.,   9.,  10.,   1.,   1.,   2.,   3.,   1.,   2.,   3.,
         4.,   5.,   6.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,
         9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,
        20.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,
        31.,  32.,  33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,
        42.,  43.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   1.,   2.,
         3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,
        14.,  15.,  16.,  17.,  18.,  19.,   1.,   2.,   1.,   2.,   3.,
         4.,   5.,   6.,   1.,   2.,   3.,   4.,   5.,   6.,   1.,   2.,
         3.,   4.,   1.,   2.,   3.,   4.,   5.,   6.,   1.,   2.,   3.,
         1.,   2.,   1.,   2.])


tiles = np.percentile(x, (0, 20, 40, 60, 80, 100))

print
for i in range(1, len(tiles)):
    xmin, xmax = tiles[i-1], tiles[i]
    print 'tile i=%d range=[%.2f, %.2f), count=%d'%(i, xmin, xmax,
((x>=xmin) & (x<xmax)).sum())



More information about the SciPy-Dev mailing list