[Numpy-discussion] Using matplotlib's prctile on masked arrays

Tue Oct 27 09:25:21 EDT 2009

On Tue, Oct 27, 2009 at 7:56 AM, Gökhan Sever <gokhansever at gmail.com> wrote:
> Hello,
>
> Consider this sample two columns of data:
>
>  999999.9999 999999.9999
>  999999.9999 999999.9999
>  999999.9999 999999.9999
>  999999.9999   1693.9069
>  999999.9999   1676.1059
>  999999.9999   1621.5875
>     651.8040       1542.1373
>     691.0138       1650.4214
>     678.5558       1710.7311
>     621.5777    999999.9999
>     644.8341    999999.9999
>     696.2080    999999.9999
>
> Putting into this data into a file say "sample.data" and loading with:
>
> a,b = np.loadtxt('sample.data', dtype="float").T
>
> I[16]: a
> O[16]:
> array([  1.00000000e+06,   1.00000000e+06,   1.00000000e+06,
>          1.00000000e+06,   1.00000000e+06,   1.00000000e+06,
>          6.51804000e+02,   6.91013800e+02,   6.78555800e+02,
>          6.21577700e+02,   6.44834100e+02,   6.96208000e+02])
>
> I[17]: b
> O[17]:
> array([ 999999.9999,  999999.9999,  999999.9999,    1693.9069,
>           1676.1059,    1621.5875,    1542.1373,    1650.4214,
>           1710.7311,  999999.9999,  999999.9999,  999999.9999])
>
> ### interestingly, the second column is loaded as it is but a values
> reformed a little. Why this could be happening? Any idea? Anyways, back to
> masked arrays:
>
> I[24]: am = ma.masked_values(a, value=999999.9999)
>
> I[25]: am
> O[25]:
> masked_array(data = [-- -- -- -- -- -- 651.804 691.0138 678.5558 621.5777
> 644.8341 696.208],
>              mask = [ True  True  True  True  True  True False False False
> False False False],
>        fill_value = 999999.9999)
>
>
> I[30]: bm = ma.masked_values(b, value=999999.9999)
>
> I[31]: am
> O[31]:
> masked_array(data = [-- -- -- -- -- -- 651.804 691.0138 678.5558 621.5777
> 644.8341 696.208],
>              mask = [ True  True  True  True  True  True False False False
> False False False],
>        fill_value = 999999.9999)
>
>
> So far so good. A few basic checks:
>
> I[33]: am/bm
> O[33]:
> masked_array(data = [-- -- -- -- -- -- 0.422662755126 0.418689311712
> 0.39664667346 -- -- --],
>              mask = [ True  True  True  True  True  True False False False
> True  True  True],
>        fill_value = 999999.9999)
>
>
> I[34]: mean(am/bm)
> O[34]: 0.41266624676580849
>
> Unfortunately, matplotlib.mlab's prctile cannot handle this division:
>
> I[54]: prctile(am/bm, p=[5,25,50,75,95])
> O[54]:
> array([  3.96646673e-01,   6.21577700e+02,   1.00000000e+06,
>          1.00000000e+06,   1.00000000e+06])
>
>
> This also results with wrong looking box-and-whisker plots.
>
>
> Testing further with scipy.stats functions yields expected correct results:

This should not be the correct results if you use scipy.stats.scoreatpercentile,
it doesn't have correct missing value handling, it treats nans or
mask/fill values as regular numbers sorted to the end.

stats.mstats.scoreatpercentile  is the corresponding function for
masked arrays.

(BTW I wasn't able to quickly copy and past your example because
MaskedArrays don't seem to have a constructive __repr__, i.e.
no commas)

I don't know anything about the matplotlib story.

Josef

>
> I[55]: stats.scoreatpercentile(am/bm, per=5)
> O[55]: 0.40877012449846228
>
> I[49]: stats.scoreatpercentile(am/bm, per=25)
> O[49]:
> masked_array(data = --,
>              mask = True,
>        fill_value = 1e+20)
>
> I[56]: stats.scoreatpercentile(am/bm, per=95)
> O[56]:
> masked_array(data = --,
>              mask = True,
>        fill_value = 1e+20)
>
>
> Any confirmation?
>
>
>
>
>
>
>
> --
> Gökhan
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>