[SciPy-User] Generalized least square on large dataset

Sat Mar 10 08:45:11 EST 2012

Den 09.03.2012 21:13, skrev josef.pktd at gmail.com:
> I think Sturla has a point in that both count and length are positive. 
> It doesn't look like it's relevant for length, but in the counts there 
> is a bunching just above zero, this creates either a non-linearity or 
> requires another distribution log-normal (?) or Poisson (without 
> zeros, or loc=1)? Josef

You can see that the dependent variable is counts with most of them 
below 10. So I maintain that appropriate model is Poisson regression.

That is,

    COX_count ~ Poission(lambda)

with

    log(lambda) = b0 + b1 * genome_length

Or if there are N groups of bacteria,

    log(lambda) = b[0] + b[1] * genome_length
          + np.dot(b[2:N+1], group[0:N-1])

with N-1 dummy indicator variables in the vector "group".

One could of course consider even more complicated models, such as 
interaction terms between bacterial group and genome length. It's just a 
matter of adding in the appropriate predictor variables.

Normally, the p-value of a Poisson regression model can be inferred from 
the likelihood ratio against a reduced model if samples are independent.

But if samples are not independent, one cannot assume that the total 
log-likelihood for the whole data is the sum of log-likelihoods for each 
data point. So Peter would need to derive a correction for this. I 
cannot be more specific because I don't know the specifics about how 
this between-sample dependency is generated. Perhaps Peter could explain it?

Sturla