Probabilistic unit tests?

duncan smith buzzard at invalid.invalid
Fri Jan 11 13:05:05 EST 2013


On 11/01/13 01:59, Nick Mellor wrote:
> Hi,
>
> I've got a unit test that will usually succeed but sometimes fails. An occasional failure is expected and fine. It's failing all the time I want to test for.
>
> What I want to test is "on average, there are the same number of males and females in a sample, give or take 2%."
>
> Here's the unit test code:
> import unittest
> from collections import counter
>
> sex_count = Counter()
> for contact in range(self.binary_check_sample_size):
>      p = get_record_as_dict()
>      sex_count[p['Sex']] += 1
> self.assertAlmostEqual(sex_count['male'],
>                         sex_count['female'],
>                         delta=sample_size * 2.0 / 100.0)
>
> My question is: how would you run an identical test 5 times and pass the group *as a whole* if only one or two iterations passed the test? Something like:
>
>      for n in range(5):
>          # self.assertAlmostEqual(...)
>          # if test passed: break
>      else:
>          self.fail()
>
> (except that would create 5+1 tests as written!)
>
> Thanks for any thoughts,
>
> Best wishes,
>
> Nick
>

The appropriateness of "give or take 2%" will depend on sample size. 
e.g. If the proportion of males should be 0.5 and your sample size is 
small enough this will fail most of the time regardless of whether the 
proportion is 0.5.

What you could do is perform a statistical test. Generally this involves 
generating a p-value and rejecting the null hypothesis if the p-value is 
below some chosen threshold (Type I error rate), often taken to be 0.05. 
Here the null hypothesis would be that the underlying proportion of 
males is 0.5.

A statistical test will incorrectly reject a true null in a proportion 
of cases equal to the chosen Type I error rate. A test will also fail to 
reject false nulls a certain proportion of the time (the Type II error 
rate). The Type II error rate can be reduced by using larger samples. I 
prefer to generate several samples and test whether the proportion of 
failures is about equal to the error rate.

The above implies that p-values follow a [0,1] uniform density function 
if the null hypothesis is true. So alternatively you could generate many 
samples / p-values and test the p-values for uniformity. That is what I 
generally do:


p_values = []
for _ in range(numtests):
     values = data generated from code to be tested
     p_values.append(stat_test(values))
test p_values for uniformity


The result is still a test that will fail a given proportion of the 
time. You just have to live with that. Run your test suite several times 
and check that no one test is "failing" too regularly (more often than 
the chosen Type I error rate for the test of uniformity). My experience 
is that any issues generally result in the test of uniformity being 
consistently rejected (which is why a do that rather than just 
performing a single test on a single generated data set).

In your case you're testing a Binomial proportion and as long as you're 
generating enough data (you need to take into account any test 
assumptions / approximations) the observed proportions will be 
approximately normally distributed. Samples of e.g. 100 would be fine. 
P-values can be generated from the appropriate normal 
(http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval), 
and uniformity can be tested using e.g. the Kolmogorov-Smirnov or 
Anderson-Darling test 
(http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).

I'd have thought that something like this also exists somewhere. How do 
people usually test e.g. functions that generate random variates, or 
other cases where deterministic tests don't cut it?

Duncan



More information about the Python-list mailing list