[Offtopic] Line fitting [was Re: Numpy outlier removal]

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jan 7 20:23:31 EST 2013


On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:

> An example: Earlier today I was looking at some experimental data. A
> simple model of the process underlying the experiment suggests that two
> variables x and y will vary in direct proportion to one another and the
> data broadly reflects this. However, at this stage there is some
> non-normal variability in the data, caused by experimental difficulties.
> A subset of the data appears to closely follow a well defined linear
> pattern but there are outliers and the pattern breaks down in an
> asymmetric way at larger x and y values. At some later time either the
> sources of experimental variation will be reduced, or they will be
> better understood but for now it is still useful to estimate the
> constant of proportionality in order to check whether it seems
> consistent with the observed values of z. With this particular dataset I
> would have wasted a lot of time if I had tried to find a computational
> method to match the line that to me was very visible so I chose the line
> visually.


If you mean:

"I looked at the data, identified that the range a < x < b looks linear 
and the range x > b does not, then used least squares (or some other 
recognised, objective technique for fitting a line) to the data in that 
linear range"

then I'm completely cool with that. That's fine, with the understanding 
that this is the first step in either fixing your measurement problems, 
fixing your model, or at least avoiding extrapolation into the non-linear 
range.

But that is not fitting a line by eye, which is what I am talking about.

If on the other hand you mean:

"I looked at the data, identified that the range a < x < b looked linear, 
so I laid a ruler down over the graph and pushed it around until I was 
satisfied that the ruler looked more or less like it fitted the data 
points, according to my guess of what counts as a close fit"

that *is* fitting a line by eye, and it is entirely subjective and 
extremely dodgy for anything beyond quick and dirty back of the envelope 
calculations[1]. That's okay if all you want is to get something within 
an order of magnitude or so, or a line roughly pointing in the right 
direction, but that's all.


[...]
> I also think it would
> be highly foolish to go so far with refusing to eyeball data that you
> would accept the output of some regression algorithm even when it
> clearly looks wrong.

I never said anything of the sort.

I said, don't fit lines to data by eye. I didn't say not to sanity check 
your straight line fit is reasonable by eyeballing it.



[1] Or if your data is so accurate and noise-free that you hardly have to 
care about errors, since there clearly is one and only one straight line 
that passes through all the points.


-- 
Steven



More information about the Python-list mailing list