[Numpy-discussion] help in improving data analysis code

Fri Nov 25 06:25:03 EST 2005

Hi,
I am a newbie to Numeric/numarray programming and would appreciate
your help in improving the code below (which I'm sure is quite ugly to
an experienced numarray programmer).
An analysis we are carrying out requires the following:
1. evaluate the mean of a set of data
2. eliminate the data point farthest from the mean
3. repeat steps 1 and 2 until a certain specified fraction of points
has been eliminated.

Since this analysis will have to be performed (probably repeatedly) on
approximately ten thousand data sets, each of which contains 100-500
points, I would like the code to be as fast as possible.

Thanks for your help.

-g

====

from numarray import add, array, asarray, absolute, argsort, floor, take, size

def mean(m,axis=0):
    m = asarray(m)
    return add.reduce(m,axis)/float(m.shape[axis])

def eliminate_outliers(dat,frac):
    num_to_eliminate = int(floor(size(dat,0)*frac))
    for i in range(num_to_eliminate):
        ind = argsort(absolute(dat-mean(dat)),0)
        sdat = take(dat,ind,0)[:,0]
        dat = sdat[:-1]
    return dat

#--------------------------------------------------------------------

if __name__ == "__main__":
    from MLab import rand
    sz = 100
    nn = rand(sz,1)
    nn[:10] = 20*rand(10,1)
    nn[sz-10:] = -20*rand(10,1)
    print eliminate_outliers(nn,0.10)