Numpy outlier removal

Paul Simon psimon at sonic.net
Sun Jan 6 21:21:06 EST 2013


"Steven D'Aprano" <steve+comp.lang.python at pearwood.info> wrote in message 
news:50ea28e7$0$30003$c3e8da3$5496439d at news.astraweb.com...
> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>
>> I have a dataset that consists of a dict with text descriptions and
>> values that are integers. If required, I collect the values into a list
>> and create a numpy array running it through a simple routine:
>>
>> data[abs(data - mean(data)) < m * std(data)]
>>
>> where m is the number of std deviations to include.
>
> I'm not sure that this approach is statistically robust. No, let me be
> even more assertive: I'm sure that this approach is NOT statistically
> robust, and may be scientifically dubious.
>
> The above assumes your data is normally distributed. How sure are you
> that this is actually the case?
>
> For normally distributed data:
>
> Since both the mean and std calculations as effected by the presence of
> outliers, your test for what counts as an outlier will miss outliers for
> data from a normal distribution. For small N (sample size), it may be
> mathematically impossible for any data point to be greater than m*SD from
> the mean. For example, with N=5, no data point can be more than 1.789*SD
> from the mean. So for N=5, m=1 may throw away good data, and m=2 will
> fail to find any outliers no matter how outrageous they are.
>
> For large N, you will expect to find significant numbers of data points
> more than m*SD from the mean. With N=100000, and m=3, you will expect to
> throw away 270 perfectly good data points simply because they are out on
> the tails of the distribution.
>
> Worse, if the data is not in fact from a normal distribution, all bets
> are off. You may be keeping obvious outliers; or more often, your test
> will be throwing away perfectly good data that it misidentifies as
> outliers.
>
> In other words: this approach for detecting outliers is nothing more than
> a very rough, and very bad, heuristic, and should be avoided.
>
> Identifying outliers is fraught with problems even for experts. For
> example, the ozone hole over the Antarctic was ignored for many years
> because the software being used to analyse it misidentified the data as
> outliers.
>
> The best general advice I have seen is:
>
> Never automatically remove outliers except for values that are physically
> impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"),
> unless you have good, solid, physical reasons for justifying removal of
> outliers. Other than that, manually remove outliers with care, or not at
> all, and if you do so, always report your results twice, once with all
> the data, and once with supposed outliers removed.
>
> You can read up more about outlier detection, and the difficulties
> thereof, here:
>
> http://www.medcalc.org/manual/outliers.php
>
> https://secure.graphpad.com/guides/prism/6/statistics/index.htm
>
> http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html
>
> http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations
>
>
>
> -- 
> Steven
If you suspect that the data may not be normal you might look at exploratory 
data analysis, see Tukey.  It's descriptive rather than analytic, treats 
outliers respectfully, uses median rather than mean, and is very visual. 
Wherever I analyzed data both gaussian and with EDA, EDA always won.

Paul 





More information about the Python-list mailing list