[Numpy-discussion] numpy.percentile multiple arrays

Tue Jan 24 22:49:46 EST 2012

thanks for your responses,
because of the size of the dataset I will still end up with the memory
error if I calculate the median for each file, additionally the files are
not all the same size. I believe this memory problem will still arise with
the cumulative distribution calculation and not sure I understand how to
write the second suggestion about the iterative approach but will have a go.
Thanks again

On Wed, Jan 25, 2012 at 1:26 PM, Brett Olsen <brett.olsen at gmail.com> wrote:

> On Tue, Jan 24, 2012 at 6:22 PM, questions anon
> <questions.anon at gmail.com> wrote:
> > I need some help understanding how to loop through many arrays to
> calculate
> > the 95th percentile.
> > I can easily do this by using numpy.concatenate to make one big array and
> > then finding the 95th percentile using numpy.percentile but this causes a
> > memory error when I want to run this on 100's of netcdf files (see code
> > below).
> > Any alternative methods will be greatly appreciated.
> >
> >
> > all_TSFC=[]
> > for (path, dirs, files) in os.walk(MainFolder):
> >     for dir in dirs:
> >         print dir
> >     path=path+'/'
> >     for ncfile in files:
> >         if ncfile[-3:]=='.nc':
> >             print "dealing with ncfiles:", ncfile
> >             ncfile=os.path.join(path,ncfile)
> >             ncfile=Dataset(ncfile, 'r+', 'NETCDF4')
> >             TSFC=ncfile.variables['T_SFC'][:]
> >             ncfile.close()
> >             all_TSFC.append(TSFC)
> >
> > big_array=N.ma.concatenate(all_TSFC)
> > Percentile95th=N.percentile(big_array, 95, axis=0)
>
> If the range of your data is known and limited (i.e., you have a
> comparatively small number of possible values, but a number of repeats
> of each value) then you could do this by keeping a running cumulative
> distribution function as you go through each of your files.  For each
> file, calculate a cumulative distribution function --- at each
> possible value, record the fraction of that population strictly less
> than that value --- and then it's straightforward to combine the
> cumulative distribution functions from two separate files:
> cumdist_both = (cumdist1 * N1 + cumdist2 * N2) / (N1 + N2)
>
> Then once you've gone through all the files, look for the value where
> your cumulative distribution function is equal to 0.95.  If your data
> isn't structured with repeated values, though, this won't work,
> because your cumulative distribution function will become too big to
> hold into memory.  In that case, what I would probably do would be an
> iterative approach:  make an approximation to the exact function by
> removing some fraction of the possible values, which will provide a
> limited range for the exact percentile you want, and then walk through
> the files again calculating the function more exactly within the
> limited range, repeating until you have the value to the desired
> precision.
>
> ~Brett
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120125/761fa871/attachment.html>