[Offtopic] Line fitting [was Re: Numpy outlier removal]

Mon Jan 7 21:06:35 EST 2013

On Tue, 08 Jan 2013 06:43:46 +1100, Chris Angelico wrote:

> On Tue, Jan 8, 2013 at 4:58 AM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> Anyone can fool themselves into placing a line through a subset of non-
>> linear data. Or, sadly more often, *deliberately* cherry picking fake
>> clusters in order to fool others. Here is a real world example of what
>> happens when people pick out the data clusters that they like based on
>> visual inspection:
>>
>> http://www.skepticalscience.com/images/TempEscalator.gif
> 
> And sensible people will notice that, even drawn like that, it's only a
> ~0.6 deg increase across ~30 years. Hardly statistically significant,

Well, I don't know about "sensible people", but magnitude of an effect 
has little to do with whether or not something is statistically 
significant or not. Given noisy data, statistical significance relates to 
whether or not we can be confident that the effect is *real*, not whether 
it is a big effect or a small effect.

Here's an example: assume that you are on a fixed salary with a constant 
weekly income. If you happen to win the lottery one day, and consequently 
your income for that week quadruples, that is a large effect that fails 
to have any statistical significance -- it's a blip, not part of any long-
term change in income. You can't conclude that you'll win the lottery 
every week from now on.

On the other hand, if the government changes the rules relating to tax, 
deductions, etc., even by a small amount, your weekly income might go 
down, or up, by a single dollar. Even though that is a tiny effect, it is 
*not* a blip, and will be statistically significant. In practice, it 
takes a certain number of data points to reach that confidence level. 
Your accountant, who knows the tax laws, will conclude that the change is 
real immediately, but a statistician who sees only the pay slips may take 
some months before she is convinced that the change is signal rather than 
noise. With only three weeks pay slips in hand, the statistician cannot 
be sure that the difference is not just some accounting error or other 
fluke, but each additional data point increases the confidence that the 
difference is real and not just some temporary aberration.

The other meaning of "significant" has nothing to do with statistics, and 
everything to do with "a difference is only a difference if it makes a 
difference". 0.2° per decade doesn't sound like much, not when we 
consider daily or yearly temperatures that typically have a range of tens 
of degrees between night and day, or winter and summer. But that is 
misunderstanding the nature of long-term climate versus daily weather and 
glossing over the fact that we're only talking about an average and 
ignoring changes to the variability of the climate: a small increase in 
average can lead to a large increase in extreme events.

> given that weather patterns have been known to follow cycles at least
> that long.

That is not a given. "Weather patterns" don't last for thirty years. 
Perhaps you are talking about climate patterns? In which case, well, yes, 
we can see a very strong climate pattern of warming on a time scale of 
decades, with no evidence that it is a cycle.

There are, of course, many climate cycles that take place on a time frame 
of years or decades, such as the North Atlantic Oscillation and the El 
Nino Southern Oscillation. None of them are global, and as far as I know 
none of them are exactly periodic. They are noise in the system, and 
certainly not responsible for linear trends.

-- 
Steven