[Offtopic] Line fitting [was Re: Numpy outlier removal]

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jan 7 12:58:42 EST 2013


On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote:

> There are sometimes good reasons to get a line of best fit by eye. In
> particular if your data contains clusters that are hard to separate,
> sometimes it's useful to just pick out roughly where you think a line
> through a subset of the data is.

Cherry picking subsets of your data as well as line fitting by eye? Two 
wrongs do not make a right.

If you're going to just invent a line based on where you think it should 
be, what do you need the data for? Just declare "this is the line I wish 
to believe in" and save yourself the time and energy of collecting the 
data in the first place. Your conclusion will be no less valid.

How do you distinguish between "data contains clusters that are hard to 
separate" from "data doesn't fit a line at all"?

Even if the data actually is linear, on what basis could we distinguish 
between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit 
by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely 
subjective judgement can be equally denied on the basis of subjective 
judgement.

Anyone can fool themselves into placing a line through a subset of non-
linear data. Or, sadly more often, *deliberately* cherry picking fake 
clusters in order to fool others. Here is a real world example of what 
happens when people pick out the data clusters that they like based on 
visual inspection:

http://www.skepticalscience.com/images/TempEscalator.gif

And not linear by any means, but related to the cherry picking theme:

http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif


To put it another way, when we fit patterns to data by eye, we can easily 
fool ourselves into seeing patterns that aren't there, or missing the 
patterns which are there. At best line fitting by eye is prone to honest 
errors; at worst, it is open to the most deliberate abuse. We have eyes 
and brains that evolved to spot the ripe fruit in trees, not to spot 
linear trends in noisy data, and fitting by eye is not safe or 
appropriate.


-- 
Steven



More information about the Python-list mailing list