[SciPy-User] Unit testing of Bayesian estimator

Anne Archibald peridot.faceted at gmail.com
Sun Nov 8 17:51:08 EST 2009


2009/11/8  <josef.pktd at gmail.com>:

> When I do a Monte Carlo for point estimates, I usually check bias,
> variance, mean squared error,
> and mean absolute and median absolute error (which is a more
> robust to outliers, e.g. because for some cases the estimator produces
> numerical nonsense because of non-convergence or other numerical
> problems). MSE captures better cases of biased estimators that are
> better in MSE sense.

I can certainly compute all these quantities from a collection of
Monte Carlo runs, but I don't have any idea what values would indicate
correctness, apart from "not too big".

> I ran your test, test_bayes.py for M = 50, 500 and 1000 adding "return
> in_interval_f"
> and inside = test_credible_interval()
>
> If my reading is correct inside should be 80% of M, and you are pretty close.
> (M=1000 is pretty slow on my notebook)

Yeah, that's the problem with using the world's simplest numerical
integration scheme.

>>>> inside
> 39
>>>> 39/50.
> 0.78000000000000003
>>>>
>>>> inside
> 410
>>>> inside/500.
> 0.81999999999999995
>>>>
>>>> inside/1000.
> 0.81499999999999995
>
> I haven't looked enough on the details yet, but I think this way you could
> test more quantiles of the distribution, to see whether the posterior
> distribution is roughly the same as the sampling distribution in the
> MonteCarlo.

I could test more quantiles, but I'm very distrustful of testing more
than one quantile per randomly-generated sample: they should be
covariant (if the 90% mark is too high, the 95% mark will almost
certainly be too high as well) and I don't know how to take that into
account. And running the test is currently so slow I'm inclined to
spend my CPU time on a stricter test of a single quantile. Though
unfortunately to increase the strictness I also need to improve the
sampling in phase and fraction.

> In each iteration of the Monte Carlo you get a full posterior distribution,
> after a large number of iterations you have a sampling distribution,
> and it should be possible to compare this distribution with the
> posterior distributions. I'm still not sure how.

I don't understand what you mean here. I do get a full posterior
distribution out of every simulation. But how would I combine these
different distributions, and what would the combined distribution
mean?

> two questions to your algorithm
>
> Isn't np.random.shuffle(r) redundant?
> I didn't see anywhere were the sequence of observation in r would matter.

It is technically redundant. But since the point of all this is that I
don't trust my code to be right, I want to make sure there's no way it
can "cheat" by taking advantage of the order. And in any case, the
slow part is my far-too-simple numerical integration scheme. I'm
pretty sure the phase integration, at least, could be done
analytically.

> Why do you subtract mx in the loglikelihood function?
>    mx = np.amax(lpdf)
>    p = np.exp(lpdf - mx)/np.average(np.exp(lpdf-mx))

This is to avoid overflows. I could just use logsumexp/logaddexp, but
that's not yet in numpy on any of the machines I regularly use. It has
no effect on the value, since it's subtracted from top and bottom
both, but it ensures that the largest value exponentiated is exactly
zero.

>>>> I can even generate models and then data
>>>> sets that are drawn from the prior distribution, but what should I
>>>> expect from the code output on such a data set?
>>>
>>> If you ignore the Bayesian interpretation, then this is just a
>>> standard sampling problem, you draw prior parameters and observations,
>>> the rest is just finding the conditional and marginal probabilities. I
>>> think the posterior odds ratio should converge in a large Monte Carlo
>>> to the true one, and the significance levels should correspond to the
>>> one that has been set for the test (5%).
>>> (In simplest case of conjugate priors, you can just interpret the
>>> prior as a previous sample and you are back to a frequentist
>>> explanation.)
>>
>> This sounds like what I was trying for - draw a model according to the
>> priors, then generate a data set according to the model. I then get
>> some numbers out: the simplest is a probability that the model was
>> pulsed, but I can also get a credible interval or an estimated CDF for
>> the model parameters.  But I'm trying to figure out what test I should
>> apply to those values to see if they make sense.
>>
>> For a credible interval, I suppose I could take (say) a 95% credible
>> interval, then 95 times out of a hundred the model parameters I used
>> to generate the data set should be in the credible interval. And I
>> should be able to use the binomial distribution to put limits on how
>> close to 95% I should get in M trials. This seems to work, but I'm not
>> sure I understand why. The credible region is obtained from a
>> probability distribution for the model parameters, but I am turning
>> things around and testing the distribution of credible regions.
>
> If you ignore the Bayesian belief interpretation, then it's just a
> problem of Probability Theory, and you are just checking the
> small and large sample behavior of an estimator and a test,
> whether it has a Bayesian origin or not.

Indeed. But with frequentist tests, I have a clear statement of what
they're telling me that I can test against: "If you feed this test
pure noise you'll get a result this high with probability p". I
haven't figured out how to turn the p-value returned by this test into
something I can test against.

>> In any case, that seems to work, so now I just need to figure out a
>> similar test for the probability of being pulsed.
>
> "probability of being pulsed"
> I'm not sure what test you have in mind.
> There are two interpretations:
> In your current example, fraction is the fraction of observations that
> are pulsed and fraction=0 is a zero probability event. So you cannot
> really test fraction==0 versus fraction >0.
>
> In the other interpretation you would have a prior probability (mass)
> that your star is a pulsar with fraction >0 or a non-pulsing unit
> with fraction=0.

This is what the code currently implements: I begin with a 50% chance
the signal is unpulsed and a 50% chance the signal is pulsed with some
fraction >= 0.

> The probabilities in both cases would be similar, but the interpretation
> of the test would be different, and differ between frequentists and
> Bayesians.
>
> Overall your results look almost too "nice", with 8000 observations
> you get a very narrow posterior in the plot.

If you supply a fairly high pulsed fraction, it's indeed easy to tell
that it's pulsed with 8000 photons; the difficulty comes when you're
looking for a 10% pulsed fraction; it's much harder than 800 photons
with a 100% pulsed fraction. If I were really interested in the
many-photons case I'd want to think about a prior that made more sense
for really small fractions. But I'm keeping things simple for now.

Anne



More information about the SciPy-User mailing list