[SciPy-user] scipy.stats rv objects from data

Mon Apr 28 05:02:31 EDT 2008

rv_discrete: most (if not all) of the scipy.stats functions +
numpy.random cannot handle zeros as _inputs_ (don't know whether this is
related to getting zero's _out_, but it might be). The zero problem is,
I am told, due to the underlying c code, not python. A quick workaround
is to substitute any zeros for a small number like 1e-16

On Sun, 2008-04-27 at 16:29 -0700, Erik Tollerud wrote:
> I'm finding the scipy.stats documentation somewhat difficult to
> follow, so maybe the answer to this question is in there... I can't
> really find it, though.

> What I have is a sequence of numbers X_i . Two things I'd like to be
> able to do with this:
> 1. Create a discrete probability distribution (class rv_discrete) from
> this data so as to use the utility functions that take rv_discrete
> objects.
> The rv_discrete documentation suggests should be easy.  I did the following
> >>>ddist=rv_discrete(values=(x,[1/len(x) for i in x]),name='test')
> >>>ddist.pmf(50)
> array(0.0)

There are at least 2 ways of using rv_discrete

e.g. 2 ways to calculate the next element of a simple Markov Chain with
x(n+1)=Norm(0.5 x(n),1)

from scipy.stats import rv_discrete
from numpy.random import multinomial

    x = 3
    n1 = stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 )[0] 
    print n1 
    n2 = stats.rv_discrete.rvs( stats.rv_discrete( name='sample',
values=([0,1,2],[3/10.,5/10.,2/10.])), 0.5*x, 1.0 )[0] 
    print n2 
    print
    sample = stats.rv_discrete( name='sample',
values=([0,1,2],[3/10.,5/10.,2/10.]) ).rvs( size=10 ) 
    print sample

The multinomial distribution from numpy.random is somewhat faster (40
times or so) but has a different idiom:

SIZE = 100000
VALUES = [0,1,2,3,4,5,6,7]
PROBS = [1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.]

The idiom for rv_discrete is 
	rv_discrete( name='sample', values=(VALUES,PROBS) ) 

The idiom for numpy.multinomial is different; if memory serves, you get
frequencies as output instead of the actual values
	multinomial( SIZE, PROBS )

>>> from numpy.random import multinomial
>>> multinomial(100,[ 0.2, 0.4, 0.1, 0.3 ])
array([12, 44, 10, 34])
>>> multinomial( 100, [0.2, 0.0, 0.8, 0.0] ) <-- don't do this
...
>>> multinomial( 100, [0.2, 1e-16, 0.8, 1e-16] ) <-- or this
>>> multinomial( 100, [0.2-1e-16, 1e-16, 0.8-1e-16, 1e-16] ) <-- ok	
array([21,  0, 79,  0])

the last one is ok since the probability adds up to 1... painful, but it
works

> Any value I try to get of the pmf seems to be 0.  Do I have to
> explicitly subclass rv_discrete with my data and a _pmf method or
> something? This seems like a very natural thing to want to do, and
> hence it seems odd to not have some helper like
> make_dist(x,name='whatever') .  I can take a shot at creating such a
> function, but I don't want to do so if one exists.
> 
> 2. Create a continuous probability distribution from something like
> spline fitting or simple linear interpolation of a the data in X_i.
> Does this require explict subclassing, or is there a straightforward
> way to do it that's builtin?  I'm not sure if this step is strictly
> necessary - what I really want to do is be able to draw from the
> discrete distribution in 1 just by sampling the cdf... maybe this is
> how it's supposed to work with the discrete distribution, but when I
> tried to sample it using ddist.rvs, I would always get the input
> values I specified rather random values sampled from the cdf.

Continuous v's discrete: i found this in ./stats/scstats.py

from scipy import stats, r_ 	
from pylab import show, plot 	
import copy

# SOURCE: ./stats/scstats.py

SPREAD = 10

class cdf( object ):
    """ Baseclass for the task of determining a sequence of numbers {vi}
which is distributed as a random variable X
    """ 
    def integerDensityFunction( self ):
    	"""
 	Outputs an integer density function: xs (ints) and ys (probabilities)
which are the 		correspondence between the whole numbers on the x axis
to the probabilities on the y axis, according to a normal distribution.
	"""
    	opt = []
    	for i in r_[-SPREAD:SPREAD:100j]: # 2-tailed test (?)
            opt.append(( i, stats.norm.cdf(i) )) # ( int, P(int) ) 
    	return zip(*opt) 			# [ (int...), (P...) ]

    def display( self ):
    	xs, ys = self.integerDensityFunction()  
	plot( xs, ys )
	show()   	

if __name__=='__main__':
    d = cdf()
    d.display()	

Continuous: i can only suggest using rv_continuous

	stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 ).whatever

	.rvs( shape, loc, scale ) is the random variates 
	.pdf( x, shape, loc, scale ) is the probability density function which,
i think, is or should be genuinely continuous

> I'm on scipy 0.6.0 and numpy 1.0.4
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user