[SciPy-dev] New maximum entropy and Monte Carlo packages

Wed Jan 18 06:02:10 EST 2006

Hi all,

I recently moved two new packages, maxent and montecarlo, from the
sandbox into the main SciPy tree.  I've now moved them back to the
sandbox pending further discussion.  I'll introduce them here and ask
for feedback on whether they should be included in the main tree.

The maxent package is for fitting maximum entropy models subject to
expectation constraints.  Maximum entropy models represent the 'least
biased' models subject to given constraints.  When the constraints are
on the expectations of functionals -- the usual formulation -- maximum
entropy models take the form of a generalized exponential family.  A
normal distribution, for example, is a maximum entropy distribution
subject to mean and variance constraints.

The maxent package contains one main module and one module with utility
functions.  Both are entirely in Python.  (I have now removed the F2Py
dependency.)  The main module supports fitting models on either small or
large sample spaces, where 'large' means continuous or otherwise too
large to iterate over.  Maxent models on 'small' sample spaces are
common in natural language processing; models on 'large' sample spaces
are useful for channel modelling in mobile communications, spectrum and
chirp analysis, and (I believe) fluid turbulence.  Some simple examples
are in the examples/ directory.  The simplest use is to define a list of
functions f, an array of desired expectations K, and a sample space, and
use the commands

>>> model = maxent.model(f, samplespace)
>>> model.fit(K)

You can then retrieve the fitted parameters directly or analyze the
model in other ways.

I've been developing the maxent algorithms and code for about 4 years. 
The code is very well commented and should be straightforward to maintain.

The montecarlo package currently does only one thing.  It generates
discrete variates from a given distribution.  It does this FAST.  On my
P4 it generates over 10^7 variates per second, even for a sample space
with 10^6 elements.  The algorithm is the compact 5-table lookup sampler
of Marsaglia.  The main module, called 'intsampler', is written in C. 
There is also a simple Python wrapper class around this called
'dictsampler' that provides a nicer interface, allowing sampling from a
distribution with arbitrary hashable objects
(e.g. strings) as labels instead of {0,1,2,...}.  dictsampler has
slightly more overhead than intsampler, but is also very fast (around
10^6 per second for me with a sample space of 10^6 elements labelled
with strings).  An example of using it to sample from this discrete
distribution:

    x       'a'       'b'       'c'
    p(x)    10/180    150/180   20/180

is:

>>> table = {'a':10, 'b':150, 'c':20}
>>> sampler = dictsampler(table)
>>> sampler.sample(10**4)
array([b, b, a, ..., b, b, c], dtype=object)

The montecarlo package is very small (and not nearly as impressive as
Christopher Fonnesbeck's PyMC package), but the functionality that is
there would be an efficient foundation for many discrete Monte Carlo
algorithms.

I'm aware of the build issue Travis Brady reported with MinGW not
defining lrand48().  I can't remember why I used this, but I'll adapt it
to use lrand() instead and report back.

Would these packages be useful?  Are there any objections to including them?

-- Ed