[Numpy-discussion] recommendations on goodness of fit functions?

Tue Apr 14 23:02:08 EDT 2009

On Tue, Apr 14, 2009 at 10:13 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> Brennan Williams wrote:
>> Charles R Harris wrote:
>>>
>>>
>>> On Mon, Apr 13, 2009 at 6:28 PM, Brennan Williams
>>> <brennan.williams at visualreservoir.com
>>> <mailto:brennan.williams at visualreservoir.com>> wrote:
>>>
>>>     Hi numpy/scipy users,
>>>
>>>     I'm looking to add some basic goodness-of-fit functions/plots to
>>>     my app.
>>>
>>>     I have a set of simulated y vs time data and a set of observed y
>>>     vs time
>>>     data.
>>>
>>>     The time values aren't always the same, i.e. there are often fewer
>>>     observed data points.
>>>
>>>     Some variables will be in a 0.0...1.0 range, others in a
>>>     0.0.....1.0e+12
>>>     range.
> Ignoring the time scale, if time is linear with respect to the
> parameters then you should be okay. If these cause numerical issues, you
> probably want to scale or standardize the time scale first. If your
> model has polynomials then you probably want to select an orthogonal
> type that has good properties for what you want. Otherwise, you make
> either transform the data or fit an appropriate non-linear model.
>
>>>
>>>     I'm also hoping to update the calculated goodness of fit value at
>>> each
>>>     simulated timestep, the idea being to allow the user to  set a
>>>     tolerance
>>>     level which if exceeded stops the simulation (which otherwise can
>>> keep
>>>     running for many hours/days).
> Exactly how does this work?
>
> I would have thought that you have simulated model that has set
> parameters and simulates the data based on the inputs that include a
> time range. For example, if I have a simple linear model Y=Xb, I would
> estimate parameters b and then allow the user to provide their X so it
> should stop relatively quickly depending on the model complexity and
> dimensions of X.
>
> It appears like you are doing some search over some values to maximize
> the parameters b or some type of 'what if' or sensitivity scenarios. In
> the first case you probably should use one of the optimization
> algorithms in scipy. In the second case, the simulation would stop when
> it exceeds the parameter space.
>
>>>
>>>
>> Before I try and answer the following, attached is an example of a
>> suggested GOF function.
>>
>>> Some questions.
>>>
>>> 1) What kind of fit are you doing?
>>> 2) What is the measurement model?
>>> 3) What do you know apriori about the measurement errors?
>>> 4) How is the time series modeled?
>>>
>> The simulated data are output by a oil reservoir simulator.
> So much depends on the simulator model - for simple linear models
> assuming normality you could get away with R2 or mean square error or
> mean absolute deviance. But that really doesn't cut it with complex
> models. If you are trying different models, then you should look at
> model comparison techniques.
>>
>> Time series is typically monthly or annual timesteps over anything
>> from 5-30 years
>> but it could also be in some cases 10 minute timesteps over 24 hours
>>
>> The timesteps output by the simulator are controlled by the user and
>> are not always even, e.g. for a simulation over 30 years you may
>> have annual timesteps from year 0 to year 25 and then 3 monthly from
>> year 26-29 and then monthly for the most recent year.
> If the data is not measured on the same regular interval across the
> complete period (say every month) you have a potential problem of
> selecting the correct time scale and making suitable assumptions for
> missing data points (like assuming that the value from one year is equal
> to the value at 1/4 of a year). If you can not make suitable
> assumptions, you probably can not mix the different time scales so you
> probably need a piece-wise solution to handle the different periods.
>
>>
>> Not sure about measurement errors - the older the data the higher the
>> errors due to changes in oil field measurement technology.
>> And the error range varies depending on the data type as well, e.g.
>> error range for a water meter is likely to be higher than that for an
>> oil or gas meter.
>>
> This implies heterogeneity of variance over time which further
> complicates things. But again this is something that should be addressed
> when creating the simulator.

Just some ideas (assuming you have the simulator as black box):

* I would consider the difference between simulation results and data
as a univariate time series and try to estimate the simulation error
variance, either moving, depreciated or conditional, using some
approach from this literature.

* scikits time series has some statistical window functions, with
either moving window or geometric depreciation,
  which might be useful either directly or as examples

* In your goodness of fit formula, I would use p=1, i.e. mean absolute
deviation, if you have some measurement or prediction outliers or fat
tails, and p=2, i.e. RMSE, if the distribution of simulation errors
look more like a normal distribution.

* unequal spaced observation time points:
I would start with a high frequency model (the highest frequency for
that simulation e.g. monthly) with intertemporal correlation of the
simulation error, and then derive the correlation between two observed
time points for weighting the variance estimate. There is quite a
literature on irregularly observed time series, with which I'm not
very familiar. Often it starts with an assumption on an underlying
continuous time model with a selection process for the time periods at
which the process is observed.
In your case, when you don't have a large number of observation, I
would just start with a simple AR(1) and calculate the prediction
confidence interval as goodness of fit. (The problem is that this
assumes stationarity and it wouldn't get any depreciation in the
information. If there are large deviations from stationarity, then a
messier model with, for example, conditional or time varying variance
or mean would be necessary.)

* to pull the plug for simulations that go off the map: this could
also be based on some tests of stationarity, e.g. a simple test
whether there was a structural break in the mean 5 observations ago.
With only a few observations the power might not be very good and
other tests like unit root test won't have enough power, but it would
kill bad simulations. (I usually get complex numbers for prices and so
on, when I screw up in the simulation.)

* if the physical measurement errors are small compared to the
simulation errors, I would just ignore them. If you have significant
changes over time, then extra information about this is required, the
s_i in your goodness of fit formula.

It all depends on how sophisticated you want to get in this. In
finance, where investor bet large amounts of money on this, the
statistical models for this can be very complex, but maybe you can get
away just with the RMSE or prediction confidence interval.

Josef