[SciPy-dev] arithmetic on distributions

Perry, Alexander (GE Infrastructure) Alex.Perry at ge.com
Tue Aug 9 14:55:45 EDT 2005


Has anyone implemented, or considered implementing, the arithmetic methods for rv_continuous?

First suggestion.  When calling an object to freeze it, if the loc is already a distribution, we should merge its variance into the variance of the newly frozen distribution:
> ten = norm ( 10, 3 )
> print norm ( ten, 4 ).stats()
(10.0, 25.0)
Currently, an error is triggered.  Because of the difficulty of merging the standard deviation of the old value into the new desired standard deviation, I suspect that most people who encounter the error will take the lazy way out and simply discard the variance attached to the value:
> try: value=value.stats()[0]
> except: pass # the above is wrong!

Second suggestion.  When a binary operator such as __add__() is called on an object that derives from rv_continuous, currently an error is generated.  There should be a function that checks whether the other parameter is a simple number and, if it is, modifies its own loc and scale.  If both parameters of the binary operator are instances of rv_continuous, the function returns an instance of the new subclass rv_composite that simply stores those two parameters and the desired operator.  The new subclass passes through all operations that can be directly decomposed (such as rvs and stats).  This allows a real world distribution to be assembled as a single object, and that object directly supports both linear approximation and monte-carlo methods:
> fifteen = norm ( 10, 3 ) + uniform( 3, 4 )
> # Linearized estimate
> print "%7.5f" % norm ( fifteen, 0 ).cdf ( 20 )
0.94008
> # Monte carlo estimate
> print sum ( [ x<20 for x in fifteen.rvs(100000) ])
94125

The composite distribution instance implements the remainder of the standard "rv" features through integration and differentiation of the underlying distributions.  This is no different than what the users would have to do themselves, but saves the effort of explicitly calling those scipy features.
> # Accurate integrated value
> print "%7.5f" % fifteen.cdf(20)
0.94000

Third suggestion.  The __cmp__ method should exist and can simply compare the mean since its python definition is that it returns a Boolean upon which flow of control can be decided.  The more accurate comparison functions (such as __lt__) should exist and return probabilities because their python definition is that they return a data result with no flow control interpretation.
> print "%7.5f" % ( fifteen < 20 )
0.94000

Unlike the CDF, which assumes that the parameter does not itself have uncertainty, the comparison functions have to allow for the fact that both sides could be distributions.  Consequently:
> three = norm(3,0.3)
> four  = norm(4,0.4)
> print three.cdf(4)
0.9996
> print four.sf(3)
0.9938
> print three < four
0.9772

def rv_continuous.__cmp__(self,other):
  try: other = other.stats()[0]
  except: pass
  return cmp ( self.stats()[0], other )

def rv_continuous.__lt__(self,other):
  return (self-other).cdf(0)

Comments?
	Alex.

PS.  I don't use discrete distributions anything like as much as continuous, so I'm not going to comment on whether it's a good idea to extend the above concept to rv_discrete.  It's possible to return any comparison result as a binomial distribution on a single trial, making it easier to pursue statistical conclusions on uncertain data, but I've no idea whether that would be helpful.  Similarly, just like "norm+norm" can be optimized, so can "binomial**int" ... and similar cases.




More information about the SciPy-Dev mailing list