[SciPy-dev] GLMs ?

Skipper Seabold jsseabold at gmail.com
Sat Aug 15 10:27:25 EDT 2009


On Sat, Aug 15, 2009 at 9:19 AM, Bruce Southey<bsouthey at gmail.com> wrote:
> On Sat, Aug 15, 2009 at 8:12 AM, <josef.pktd at gmail.com> wrote:
>> On Sat, Aug 15, 2009 at 7:35 AM, <josef.pktd at gmail.com> wrote:
>>> On Sat, Aug 15, 2009 at 3:32 AM, Pierre GM<pgmdevlist at gmail.com> wrote:
>>>>
>>>> On Aug 15, 2009, at 3:00 AM, David Warde-Farley wrote:
>>>>
>>>>>
>>>>> On 14-Aug-09, at 7:29 PM, josef.pktd at gmail.com wrote:
>>>>>
>>>>>>> Fab'.
>>>>>>> FYI, I need to fit Tweedie distributions to precipitation series. I
>>>>>>> have already coded the distributions in the scipy standard, and
>>>>>>> now I
>>>>>>> need to estimate the parameters...
>>>>>>> Thanks again
>>>>>
>>>>> As I understand it, the Tweedie distributions are a further
>>>>> generalization of the exponential family.
>>>>
>>>> Indeed.
>>>>
>>>>> Are you saying that your
>>>>> parametric assumption is that they are Tweedie but not any of the
>>>>> standard ones like Gaussian, Poisson, Gamma?
>>>>
>>>> Yes, something intermediate between Poisson and Gamma, with a variance
>>>> proportional to the mean to a power 1<=p<=2.
>>>>
>>>>>> Are you trying to estimate parameters of the distribution themselves,
>>>>>> or parameters of the distribution as function of some explanatory
>>>>>> variables? In the first case, GLM won't be of much help.
>>>>>
>>>>> Is it that you have samples of a (nonstandard) Tweedie random variable
>>>>> that you want to regress on explanatory variables?
>>>>> You can probably do it by gradient descent but I don't foresee it
>>>>> being pretty and probably not even convex. Either way, a GLM package
>>>>> probably won't  help.
>>>>
>>>> I'm not sure yet whether GLMs are the way to go to my particular
>>>> problem. I'm trying to reproduce an approach to model precipitation
>>>> patterns (keeping track of both the number and intensities of rainfall
>>>> events) described in several papers. I know that at term, I'll have to
>>>> introduce extra variables and then GLMs will be the way to go. I just
>>>> wanted to check what algorithms were already available.
>>>> Thanks a lot for your comments.
>>>
>>> Using models.GLM could be as easy as adding a new distribution to the
>>> family. The main algorithm is (supposed to be) independent of the
>>> distribution, and all distribution specific code is supposed to be in
>>> family.
>>>
>>> If Tweedie is like Poisson and Gamma, mainly with a different variance
>>> function, then I think it *should* work with very little work.
>>>
>>> If you try this, then this would be a good check for how general our
>>> implementation is, and whether there are still some hidden,
>>> distribution specific assumptions left.
>>>
>>> And it will be good if we soon have more eyes on the models code,
>>> because I don't think we have settled on a good API yet.
>>>
>>> Josef
>>>
>>
>> I should have done a bit of homework first.
>>
>> The tweedie family of distributions looks very interesting, and it
>> should fit in both glm and maximum likelihood framework. R/S has
>> tweedie in GLM and ML in fbasics.
>>
>> So I would be very interested in seeing it both in models.glm and in
>> scipy.stats.distributions. However, in the short term, I see two
>> potential issues
>>
>> Wikipedia: "Apart from the four special cases identified above, their
>> probability density function have no closed form. However, software is
>> available that enables the accurate computation of the Tweedie
>> densities (and probability distribution functions)"
>>
>> Currently the models code is in pure python, which makes distribution
>> as a standalone package much easier, until the dust has settled, and
>> models is reintegrated into scipy. Do you have the numerical
>> calculations in python or compile, fortran,C? I didn't find tweedie in
>> hydroclimpy.
>>
>> Wikipedia: "For 1 < p < 2, the distribution is continuous on the
>> positive reals, plus an added mass (exact zero) at Y = 0"
>>
>> The generic framework for the distributions in stats.distributions
>> doesn't handle, currently, distributions that have continuous and
>> discrete support (masspoints). In some cases, this can be extended by
>> delegation, but we could think about to handle the mixed case.
>>
>> Josef
>>
>
> Hi,
> Technically generalized linear models only apply to a exponential
> family of distributions. However, it does work for some other like
> negative binomial so you have quazi-likelihoods. Typically you require
> the link and inverse link functions to do it and nonlinear modeling or
> iteratively reweighted least squares to solve. See Wikipedia:
> http://en.wikipedia.org/wiki/Generalized_linear_model
>

This is indeed the case.

It is my understanding that the negative binomial distribution is
thought of as a full member of the exponential family even though it
is a two-parameter exponential (since the ancillary parameter is
assumed to be nonstochastic for our code at the moment) and our
approach is then not quasi-likelihood.  I wanted to extend the GLMs to
allow for estimation of the variance function, etc. in a
quasi-likelihood framework, but I didn't have the time.

Our approach uses iteratively reweighted least squares, which is
really only ~5 lines of code if you have access to weighted least
squares.

Having a quick look at the Tweedie distribution (I came across them in
the literature as Poisson-Gamma mixture models, which includes the
negative binomial), it shouldn't be any problem to include support for
them in the current framework under assumptions on the power in the
variance, if this will suit your needs.  Once I've finished the
documentation it should be pretty clear on how to add distributions.
It took almost no time to add support for the inverse gaussian and
negative binomial distributions.  As mentioned above you only need to
define the variance function, the link function, its inverse and its
derivative (could be done numerically).

Skipper



More information about the SciPy-Dev mailing list