[SciPy-User] ancova with optimize.curve_fit

Sebastian Haase seb.haase at gmail.com
Tue Dec 7 13:00:11 EST 2010


On Tue, Dec 7, 2010 at 5:35 PM,  <josef.pktd at gmail.com> wrote:
> On Tue, Dec 7, 2010 at 4:58 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>> On 12/06/2010 10:00 PM, Peter Tittmann wrote:
>>
>> Gentlemen,
>> I've decided to switch to the OLS method, thought I did get the NNLS method
>> that Skipper proposed working. I was not prepared to spend more time trying
>> to make sense of the resulting array for ancova, etc. (also could not figure
>> out how to interpret the resulting coefficient array as I was expecting a 2d
>> array representing the a and b coefficient values but it returned a 1d
>> array). I have hopefully simple follow up questions:
>> 1. Is there a method to define explicitly the function used in OLS? I know
>> numpy.linalg.lstsq is the way OLS works but is there another function where
>> I can define the form?
>> 2. I'm still interested in interpreting the results of the NNLS method, so
>> if either of you can suggest what the resulting arrays mean id be grateful.
>> I've attached the output of NNLS
>> warm regards,
>> Peter
>>
>> Here is the working version of NNLS:
>> def getDiam2(ht,*b):
>>     return b[0] * ht[:,1]**b[1] + np.sum(b[2:]*ht[:,2:], axis=1)
>> dt = np.genfromtxt('/home/peter/Desktop/db_out.csv', delimiter=",",
>> names=True, dtype=None)
>>
>> indHtPlot = adt['height']
>> depDbh = adt['dbh']
>> plot_dummies, col_map = sm.tools.categorical(dt['plot], drop=True,
>> dictnames=True)
>>
>> def nnlsDummies():
>>     '''this function returns coefficients and covariance arrays'''
>>     plot_dummies, col_map = sm.tools.categorical(indPlot, drop=True,
>> dictnames=True)
>>     X = np.column_stack((indHt, plot_dummies))
>>     Y = depDbh
>>     coefs, cov = curve_fit(getDiam2, X, Y, p0= [0.]*X.shape[1])
>>     return coefs, cov
>>
>> --
>> Peter Tittmann
>>
>> On Monday, December 6, 2010 at 4:55 PM, josef.pktd at gmail.com wrote:
>>
>> On Mon, Dec 6, 2010 at 7:41 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>
>> On Mon, Dec 6, 2010 at 7:31 PM, Peter Tittmann wrote:
>>> thanks both of you,
>>> Josef, the data that I sent is only the first 100 rows of about 1500,
>>> there
>>> should be sufficient sampling in each plot.
>>> Skipper, I have attempted to deploy your suggestion for not linearizing
>>> the
>>> data. It seems to work. I'm a little confused at your modification if the
>>> getDiam function and I wonder if you could help me understand. The form of
>>> the equation that is being fit is:
>>> Y= a*X^b
>>> your version of the detDaim function:
>>>
>>> def getDiam(ht, *b):
>>>    return ht[:,0]**b[0] + np.sum(b[1:]*ht[:,1:], axis=1)
>>>
>>> Im sorry if this is an obvious question but I don't understand how this
>>> works as it seems that the "a" coefficient is missing.
>>> Thanks again!
>>
>> Right.  I took out the 'a', because as I read it when I linearized (I
>> might be misunderstanding ancova, I never recall the details), if you
>> include 'a' and also all of the dummy variables for the plot, then you
>> will have a the problem of multicollinearity.  You could also include
>> 'a' and drop one of the plot dummies, but then 'a' is just your
>> reference category that you dropped.  So now b[0] is the nonlinear
>> effect of your main variable and b[1:] contains linear shift effects
>> of all the plots.  Hmm, thinking about it some more, though I think
>> you could include 'a' in the non-linear version above (call it b[0]
>> and shift everything else over by one), because now 'a' would be the
>> effect when the current b[0] is zero.  I was just unsure how you meant
>> 'a' when you had a*ht**b and were trying to include in ht the plot
>> variable dummies.
>>
>> As I understand it, the intention is to estimate equality of the slope
>> coefficients, so the continuous variable is multiplied with the dummy
>> variables. In this case, the constant should still be added. The
>> normalization question is whether to include all dummy-cont.variable
>> products and drop the continuous variable, or include the continuous
>> variable and drop one of the dummy-cont levels.
>>
>> Unless there is a strong reason to avoid log-normality of errors, I
>> would work (first) with the linear version.
>>
>> Josef
>>
>>
>>
>> Skipper
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>> I do think this is starting to be an off-list discussion because this is
>> really about statistics and not about numpy/scipy (you can contact me
>> off-list if you want).
>
> If it's too much stats, we can continue on the statsmodels list. Last
> time there was a similar question, I programmed most of the tests for
> the linear case, and it should soon be possible to do it for the
> non-linear case also.
> The target is to be able to do this in 5 (or so) lines of code for the
> linear case and maybe 10 lines for the non-linear case.
>
> Josef
>
>>
>> I am not sure what all the variables are so please excuse me but I presume
>> you want to model dbh as a function of height, plot and species. Following
>> usual biostatistics interpretation, 'plot' is probably treated as random
>> effect but you probably have to use R/SAS etc for that for both linear and
>> nonlinear models or some spatial models.
>>
>> Really you need to determine whether or not a nonlinear model is required.
>> With the few data points you provided, I only see a linear relationship
>> between dbh and height with some outliers and perhaps some heterogeneity of
>> variance. Often doing a simple polynomial/spline can help to see if there is
>> any evidence for a nonlinear relationship in the full data - a linear model
>> or polynomial with the data provided does not suggest a nonlinear model.
>> Obviously a linear model is easier to fit and interpret especially if you
>> create the design matrix as estimable functions (which is rather trivial
>> once you understand using dummy variables).
>>
>> The most general nonlinear/multilevel model proposed is of the form:
>> dbh= C + A*height^B
>> Obviously if B=1 then it is a linear model and the parameters A, B and C can
>> be modeled with a linear function of intercept, plot and species. Although,
>> if 'plot' is what I think it is then you probably would not model the
>> parameters A and B with it.
>>
>> Without C you are forcing the curve through zero which is biological
>> feasible if you expect dbh=0 when height is zero. However, dbh can be zero
>> if height is not zero just due to the model itself or what dbh actually is
>> (it may take a minimum height before dbh is greater than zero). With the
>> data you provided, there are noticeable differences between species for dbh
>> and height so you probably need C in your model.
>>
>> For this general model you probably should just fit the curve for each
>> species alone but I would use a general stats package to do this. This will
>> give you a good starting point to know how well the curve fits each species
>> as well as the similarity of parameters and residual variation. Getting
>> convergence with a model that has B varying across species may be rather
>> hard so I would suggest modeling A and C first.
>>
>> Bruce
>>
I have never heard of a "statsmodels lists" -- where and what is that !?
Thanks,
- Sebastian Haase



More information about the SciPy-User mailing list