[SciPy-User] ancova with optimize.curve_fit

Tue Dec 7 01:10:04 EST 2010

On Mon, Dec 6, 2010 at 11:00 PM, Peter Tittmann <ptittmann at gmail.com> wrote:
> Gentlemen,
> I've decided to switch to the OLS method, thought I did get the NNLS method
> that Skipper proposed working. I was not prepared to spend more time trying
> to make sense of the resulting array for ancova, etc. (also could not figure
> out how to interpret the resulting coefficient array as I was expecting a 2d
> array representing the a and b coefficient values but it returned a 1d
> array). I have hopefully simple follow up questions:
> 1. Is there a method to define explicitly the function used in OLS? I know
> numpy.linalg.lstsq is the way OLS works but is there another function where
> I can define the form?
> 2. I'm still interested in interpreting the results of the NNLS method, so
> if either of you can suggest what the resulting arrays mean id be grateful.
> I've attached the output of NNLS
> warm regards,
> Peter
>
> Here is the working version of NNLS:
> def getDiam2(ht,*b):
>     return b[0] * ht[:,1]**b[1] + np.sum(b[2:]*ht[:,2:], axis=1)
> dt = np.genfromtxt('/home/peter/Desktop/db_out.csv', delimiter=",",
> names=True, dtype=None)
>
> indHtPlot = adt['height']
> depDbh = adt['dbh']
> plot_dummies, col_map = sm.tools.categorical(dt['plot], drop=True,
> dictnames=True)
>
> def nnlsDummies():
>     '''this function returns coefficients and covariance arrays'''
>     plot_dummies, col_map = sm.tools.categorical(indPlot, drop=True,
> dictnames=True)
>     X = np.column_stack((indHt, plot_dummies))
>     Y = depDbh
>     coefs, cov = curve_fit(getDiam2, X, Y, p0= [0.]*X.shape[1])
>     return coefs, cov

(can you post at the bottom instead of the top, that's the custom on
this mailing list)

getDiam is just a linear model in this case, the first coefficient
should be the effect/slope of indHt, the remaining are the
constants/intercept for each (level of) "plot". According to your
output there would be 301 or 302 unique values in your plot array.
np.unique(indPlot)

The hypothesis that there are no differences across plots mean that
all the coefficients (except the first) are the same. An f-test would
be the usual to check this.

If instead you want to check that the effect/coefficient of Ht is
independent of plot, then you should use the product
indHt[:,None]*plot_dummies    (all plot dummies, use drop=False)

If you already have statsmodels, then you could estimate the original
linear model that Skipper described, take y=np.log(depDbh) and x =
sm.add_constant(np.log(indHt)[:,None]*plot_dummies)

then you can estimate res = sm.OLS(y.x).fit()
res.params are the parameters

Then you can do an f_test, which depends on the version of statsmodels
that you have.

You can also do an f_test with the results from the non-linear
curve_fit. I guess the easiest will be to estimate the model with and
without dummies, and compare the residual sum of squares with
scipy.stats.f_anova (?).

Josef

>
> --
> Peter Tittmann
>
> On Monday, December 6, 2010 at 4:55 PM, josef.pktd at gmail.com wrote:
>
> On Mon, Dec 6, 2010 at 7:41 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>
> On Mon, Dec 6, 2010 at 7:31 PM, Peter Tittmann wrote:
>> thanks both of you,
>> Josef, the data that I sent is only the first 100 rows of about 1500,
>> there
>> should be sufficient sampling in each plot.
>> Skipper, I have attempted to deploy your suggestion for not linearizing
>> the
>> data. It seems to work. I'm a little confused at your modification if the
>> getDiam function and I wonder if you could help me understand. The form of
>> the equation that is being fit is:
>> Y= a*X^b
>> your version of the detDaim function:
>>
>> def getDiam(ht, *b):
>>    return ht[:,0]**b[0] + np.sum(b[1:]*ht[:,1:], axis=1)
>>
>> Im sorry if this is an obvious question but I don't understand how this
>> works as it seems that the "a" coefficient is missing.
>> Thanks again!
>
> Right.  I took out the 'a', because as I read it when I linearized (I
> might be misunderstanding ancova, I never recall the details), if you
> include 'a' and also all of the dummy variables for the plot, then you
> will have a the problem of multicollinearity.  You could also include
> 'a' and drop one of the plot dummies, but then 'a' is just your
> reference category that you dropped.  So now b[0] is the nonlinear
> effect of your main variable and b[1:] contains linear shift effects
> of all the plots.  Hmm, thinking about it some more, though I think
> you could include 'a' in the non-linear version above (call it b[0]
> and shift everything else over by one), because now 'a' would be the
> effect when the current b[0] is zero.  I was just unsure how you meant
> 'a' when you had a*ht**b and were trying to include in ht the plot
> variable dummies.
>
> As I understand it, the intention is to estimate equality of the slope
> coefficients, so the continuous variable is multiplied with the dummy
> variables. In this case, the constant should still be added. The
> normalization question is whether to include all dummy-cont.variable
> products and drop the continuous variable, or include the continuous
> variable and drop one of the dummy-cont levels.
>
> Unless there is a strong reason to avoid log-normality of errors, I
> would work (first) with the linear version.
>
> Josef
>
>
>
> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>