[SciPy-User] the skellam distribution

Mon Nov 9 13:16:51 EST 2009

 9/11/09 @ 09:40 (-0500), thus spake josef.pktd at gmail.com:
> Thanks, I think the distribution of the difference of two poisson
> distributed random variables could be useful.
> 
> Would you please open an enhancement ticket for this at
> http://projects.scipy.org/scipy/report/1

Okay, I will.

> I had only a brief look at it so far, I had never looked at the
> Skellam distribution before, and just read a few references.
> 
> The "if x < 0 .. else ..." will have to be replace with a
> "numpy.where" assignment, since the methods are supposed to work with
> arrays of x (as far as I remember)

Done.

> _rvs could be implemented directly instead of generically (I don't
> find the reference, where I saw it, right now).

I suppose it could be done, but I can't figure out how.
As far as I understand, you can't derive the poisson parameters
from the skellam parameters alone, you need the correlation
coefficient (between the two poisson rv) too, is that right?

> Documentation will be necessary,  a brief description in the
> (currently) extradocs, and a listing of the properties for the
> description of the distributions currently in the stats tutorial.
> 
> I have some background questions, which address the limitation of the
> implementation (but are not really necessary for inclusion into
> scipy).
> 
> The description in R mentions several implementation of Skellam. Do
> you have a rough idea what the range of parameters are for which the
> implementation using ncx produces good results? Do you know if any
> other special functions would produce good results over a larger
> range, e.g. using Bessel function?

No, sorry, I have no idea. The R paper says that in R the Bessel
function is more accurate, but I don't think this means that the
Bessel function implementation is more accurate in general.
I compared results using the Bessel function in scipy and using
the ncx2 implementation, and the differences were minimal, although
I admit my testing wasn't extensive. Another problem is that
I haven't got a table of values for this particular distribution,
so in case results differ I can't tell which one is more accurate.

> Wikipedia, http://en.wikipedia.org/wiki/Skellam_distribution , also
> mentions (but doesn't describe) the case of Skellam distribution with
> correlated Poisson distributions. Do you know what the difference to
> your implementation would be?

As far as I know, it makes no difference if the two Poisson
variates are correlated or not. The Skellam parameters are
defined as

mu1 = lam1 - rho * sqrt(lam1*lam2)
mu2 = lam2 - rho * sqrt(lam1*lam2)

where lam1 and lam2 are the Poisson means and rho is the
correlation coefficient. So, in case there is correlation it
is implicit in the parameters mu1 and mu2, and it doesn't
make any difference in terms of calculating the pmf or cdf.
Again, I am no statician or mathematician.

> Tests for a new distribution will be picked up by the generic tests,
> but it would be useful to have some extra tests for extreme/uncommon
> parameter ranges. Do you have any comparisons with R, since you
> already looked it?

I have this visual test, although it's usefulness is
questionable. It shows the deviation between observed and
expected frequencies for a Skellam random variable. Ideally
errors should be random and centered around 0. To me
it looks like errors tend to increase around the mean
and are smaller along the tails, but I don't know if this
means something or not.

import numpy
import scipy.stats.distributions

poisson = scipy.stats.distributions.poisson
ncx2 = scipy.stats.distributions.ncx2

# Skellam distribution

class skellam_gen(scipy.stats.distributions.rv_discrete):
    def _pmf(self, x, mu1, mu2):
        px = numpy.where(x < 0, ncx2.pdf(2*mu2, 2*(1-x), 2*mu1)*2,
                         ncx2.pdf(2*mu1, 2*(x+1), 2*mu2)*2)
        return px
    def _cdf(self, x, mu1, mu2):
        x = numpy.floor(x)
        px = numpy.where(x < 0, ncx2.cdf(2*mu2, -2*x, 2*mu1),
                         1-ncx2.cdf(2*mu1, 2*(x+1), 2*mu2))
        return px
    def _stats(self, mu1, mu2):
        mean = mu1 - mu2
        var = mu1 + mu2
        g1 = (mean) / numpy.sqrt((var)**3)
        g2 = 1 / var
        return mean, var, g1, g2
skellam = skellam_gen(a=-numpy.inf, name="skellam", longname='A Skellam',
                      shapes="mu1,mu2", extradoc="")

if __name__ == '__main__':

    lam1 = 3.4
    lam2 = 6.1
    n = 5000

    poisson_var1 = numpy.random.poisson(lam1, n)
    poisson_var2 = numpy.random.poisson(lam2, n)
    skellam_var = poisson_var1-poisson_var2

    low = min(skellam_var)
    high = max(skellam_var)
    obs_freq = numpy.histogram(
        skellam_var, numpy.arange(low, high+2))[0] / float(n)

    rho = numpy.corrcoef(poisson_var1, poisson_var2)[1,0]
    mu1 = lam1 - rho * numpy.sqrt(lam1*lam2)
    mu2 = lam2 - rho * numpy.sqrt(lam1*lam2)
    exp_freq = skellam.pmf(numpy.arange(low, high+1), mu1, mu2)
    print obs_freq-exp_freq

    # plot
    import matplotlib.pyplot as plt
    plt.figure().add_subplot(1,1,1).plot(obs_freq-exp_freq)
    plt.show()

Regards.
-- 
Ernest