[Offtopic] Line fitting [was Re: Numpy outlier removal]

Alan Spence alan.spence at ntlworld.com
Fri Jan 11 09:30:30 EST 2013


On 09 Jan 2013, at 00:02:11 Steven D'Aprano <steve at pearwood.info> wrote:

> The point I keep making, that everybody seems to be ignoring, is that 
> eyeballing a line of best fit is subjective, unreliable and impossible to 
> verify. How could I check that the line you say is the "best fit" 
> actually *is* the *best fit* for the given data, given that you picked 
> that line by eye? Chances are good that if you came back to the data a 
> month later, you'd pick a different line!

It might bring more insight to the debate if you talk about parameter error and model error.  Steven is correct if you consider only parameter error.  However model error is often the main problem, and here using visual techniques might well improve your model selection even if it's not a real model but a visually based approximation to a model.  However, if you only do it by eye, you end up in a space which is not rigorous from a modelling perspective and other issues can arise from this.  Visual techniques might also help deal with outliers but again in an unrigorous manner.  Visual techniques can also bring in real world knowledge (but this is really the same point as model selection).

With regard to the original post on outliers, Steven made a lot of excellent points.  However there are at least two important issues which he didn't mention. (1) You must think carefully and hard about the outliers. For example, can they recur, or have actions in the real world been taken that mean they can't happen again?  Are they actually data errors?  How you deal with them might be changed by these types of consideration.  (2) It is best to fit your model with and without the outliers and see what impact it has on the real world application you're doing the analysis for.  It's also good to try more than one set of excluded outliers to see just how stable the results are depending on how many outliers you remove. If the results change much, be very careful how you use the results.

Alan




More information about the Python-list mailing list