Numpy outlier removal

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jan 7 00:11:13 EST 2013


On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:

> On 7 January 2013 01:46, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>>
>>> I have a dataset that consists of a dict with text descriptions and
>>> values that are integers. If required, I collect the values into a
>>> list and create a numpy array running it through a simple routine:
>>>
>>> data[abs(data - mean(data)) < m * std(data)]
>>>
>>> where m is the number of std deviations to include.
>>
>> I'm not sure that this approach is statistically robust. No, let me be
>> even more assertive: I'm sure that this approach is NOT statistically
>> robust, and may be scientifically dubious.
> 
> Whether or not this is "statistically robust" requires more explanation
> about the OP's intention. 

Not really. Statistics robustness is objectively defined, and the user's 
intention doesn't come into it. The mean is not a robust measure of 
central tendency, the median is, regardless of why you pick one or the 
other.

There are sometimes good reasons for choosing non-robust statistics or 
techniques over robust ones, but some techniques are so dodgy that there 
is *never* a good reason for doing so. E.g. finding the line of best fit 
by eye, or taking more and more samples until you get a statistically 
significant result. Such techniques are not just non-robust in the 
statistical sense, but non-robust in the general sense, if not outright 
deceitful.



-- 
Steven



More information about the Python-list mailing list