[Offtopic] Line fitting [was Re: Numpy outlier removal]

Mon Jan 7 17:32:54 EST 2013

On 7 January 2013 17:58, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote:
>
>> There are sometimes good reasons to get a line of best fit by eye. In
>> particular if your data contains clusters that are hard to separate,
>> sometimes it's useful to just pick out roughly where you think a line
>> through a subset of the data is.
>
> Cherry picking subsets of your data as well as line fitting by eye? Two
> wrongs do not make a right.

It depends on what you're doing, though. I wouldn't use an eyeball fit
to get numbers that were an important part of the conclusion of some
or other study. I would very often use it while I'm just in the
process of trying to understand something.

> If you're going to just invent a line based on where you think it should
> be, what do you need the data for? Just declare "this is the line I wish
> to believe in" and save yourself the time and energy of collecting the
> data in the first place. Your conclusion will be no less valid.

An example: Earlier today I was looking at some experimental data. A
simple model of the process underlying the experiment suggests that
two variables x and y will vary in direct proportion to one another
and the data broadly reflects this. However, at this stage there is
some non-normal variability in the data, caused by experimental
difficulties. A subset of the data appears to closely follow a well
defined linear pattern but there are outliers and the pattern breaks
down in an asymmetric way at larger x and y values. At some later time
either the sources of experimental variation will be reduced, or they
will be better understood but for now it is still useful to estimate
the constant of proportionality in order to check whether it seems
consistent with the observed values of z. With this particular dataset
I would have wasted a lot of time if I had tried to find a
computational method to match the line that to me was very visible so
I chose the line visually.

>
> How do you distinguish between "data contains clusters that are hard to
> separate" from "data doesn't fit a line at all"?
>

In the example I gave it isn't possible to make that distinction with
the currently available data. That doesn't make it meaningless to try
and estimate the parameters of the relationship between the variables
using the preliminary data.

> Even if the data actually is linear, on what basis could we distinguish
> between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit
> by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely
> subjective judgement can be equally denied on the basis of subjective
> judgement.

It gets a bit easier if the line is constrained to go through the
origin. You seem to be thinking that the important thing is proving
that the line is "real", rather than identifying where it is. Both
things are important but not necessarily in the same problem. In my
example, the "real line" may not be straight and may not go through
the origin, but it is definitely there and if there were no
experimental problems then the data would all be very close to it.

> Anyone can fool themselves into placing a line through a subset of non-
> linear data. Or, sadly more often, *deliberately* cherry picking fake
> clusters in order to fool others. Here is a real world example of what
> happens when people pick out the data clusters that they like based on
> visual inspection:
>
> http://www.skepticalscience.com/images/TempEscalator.gif
>
> And not linear by any means, but related to the cherry picking theme:
>
> http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif
>
>
> To put it another way, when we fit patterns to data by eye, we can easily
> fool ourselves into seeing patterns that aren't there, or missing the
> patterns which are there. At best line fitting by eye is prone to honest
> errors; at worst, it is open to the most deliberate abuse. We have eyes
> and brains that evolved to spot the ripe fruit in trees, not to spot
> linear trends in noisy data, and fitting by eye is not safe or
> appropriate.

This is all true. But the human brain is also in many ways much better
than a typical computer program at recognising patterns in data when
the data can be depicted visually. I would very rarely attempt to
analyse data without representing it in some visual form. I also think
it would be highly foolish to go so far with refusing to eyeball data
that you would accept the output of some regression algorithm even
when it clearly looks wrong.

Oscar