[SciPy-User] Unit testing of Bayesian estimator

Sun Nov 8 22:22:36 EST 2009

2009/11/8  <josef.pktd at gmail.com>:
> On Sun, Nov 8, 2009 at 5:51 PM, Anne Archibald
> <peridot.faceted at gmail.com> wrote:
>> 2009/11/8  <josef.pktd at gmail.com>:
>>
>>> When I do a Monte Carlo for point estimates, I usually check bias,
>>> variance, mean squared error,
>>> and mean absolute and median absolute error (which is a more
>>> robust to outliers, e.g. because for some cases the estimator produces
>>> numerical nonsense because of non-convergence or other numerical
>>> problems). MSE captures better cases of biased estimators that are
>>> better in MSE sense.
>>
>> I can certainly compute all these quantities from a collection of
>> Monte Carlo runs, but I don't have any idea what values would indicate
>> correctness, apart from "not too big".
>
> I consider them mainly as an absolute standard to see how well
> an estimator works (or what the size and power of a test is) or
> to compare them to other estimators, which is a common case for
> publishing Monte Carlo studies for new estimators.

Ah. Yes, I will definitely be wanting to compute these at some point.
But first I'd like to make sure this estimator is doing what I want it
to.

>> I could test more quantiles, but I'm very distrustful of testing more
>> than one quantile per randomly-generated sample: they should be
>> covariant (if the 90% mark is too high, the 95% mark will almost
>> certainly be too high as well) and I don't know how to take that into
>> account. And running the test is currently so slow I'm inclined to
>> spend my CPU time on a stricter test of a single quantile. Though
>> unfortunately to increase the strictness I also need to improve the
>> sampling in phase and fraction.
>
> Adding additional quantiles might be relatively cheap, mainly the
> call to searchsorted. One or two quantiles could be consistent
> with many different distributions or e.g with fatter tails, so I usually
> check more points.

As I said, I'm concerned about using more than one credible interval
per simulation run, since the credible intervals for different
quantiles will be different sizes.

>>> In each iteration of the Monte Carlo you get a full posterior distribution,
>>> after a large number of iterations you have a sampling distribution,
>>> and it should be possible to compare this distribution with the
>>> posterior distributions. I'm still not sure how.
>>
>> I don't understand what you mean here. I do get a full posterior
>> distribution out of every simulation. But how would I combine these
>> different distributions, and what would the combined distribution
>> mean?
>
> I'm still trying to think how this can be done. Checking more quantiles
> as discussed above might be doing it to some extend.
> (I also wonder whether it might be useful to fix the observations
> during the monte carlo and only vary the sampling of the parameters ?)

I can see that fixing the data would be sort of nice, but it's not at
all clear to me what it would even mean to vary the model while
keeping the data constant - after all, the estimator has no access to
the model, only the data, so varying the model would have no effect on
the result returned.

>> Indeed. But with frequentist tests, I have a clear statement of what
>> they're telling me that I can test against: "If you feed this test
>> pure noise you'll get a result this high with probability p". I
>> haven't figured out how to turn the p-value returned by this test into
>> something I can test against.
>
> What exactly are the null and the alternative hypothesis that you
> want to test? This is still not clear to me, see also below.

Null hypothesis: no pulsations, all photons are drawn from a uniform
distributions.

Alternative: photons are drawn from a distribution with pulsed
fraction f and phase p.

>> This is what the code currently implements: I begin with a 50% chance
>> the signal is unpulsed and a 50% chance the signal is pulsed with some
>> fraction >= 0.
>
> I don't see this
>
> in generate you have m = np.random.binomial(n, fraction)
> where m is the number of pulsed observations.

Generate can be used to generate photons from a uniform distribution
by calling it with fraction set to zero. I don't actually do this
while testing the credible intervals, because  (as I understand it)
the presence of this hypothesis does not affect the credible
intervals. That is, the credible intervals I'm testing are the
credible intervals assuming that the pulsations are real. I'm not at
all sure how to incorporate the alternative hypothesis into my
testing.

> the probability of observing no pulsed observations is very
> small
>>>> stats.binom.pmf(0,100,0.05)
> 0.0059205292203340009
>
> your likelihood function pdf_data_given_model
> also treats each observation with equal fraction to be pulsed or not.

pdf_data_given_model computes the PDF given a set of model parameters.
If you want to use it to get a likelihood with fraction=0, you can
call it with fraction=0. But this likelihood is always zero.

> I would have expected something like
> case1: pulsed according to binomial fraction, same as now
> case2: no observations are pulsed
> prior Prob(case1)=0.5, Prob(case2)=0.5
>
> Or am I missing something?
>
> Where is there a test? You have a good posterior distribution
> for the fraction, which imply a point estimate and confidence interval,
> which look good from the tests.
> But I don't see a test hypothesis, (especially a Bayesian statement)

When I came to implement it, the only place I actually needed to
mention the null hypothesis was in the calculation of the pulsed
probability, which is the last value returned by the inference
routine. I did make the somewhat peculiar choice to have the model PDF
returned normalized so that its total probability was one, rather than
scaling all points by the probability that the system is pulsed at
all.

It turns out that the inference code just computes the total
normalization S over all parameters; then the probability that the
signal is pulsed is S/(S+1).

Anne