[SciPy-User] Generalized least square on large dataset

Sat Mar 10 10:05:58 EST 2012

Den 10.03.2012 14:57, skrev josef.pktd at gmail.com:
>
> He explained the between sample correlation with the similarity (my
> analogy autocorrelation in time series, or spatial correlation).
>
>

Look at his attachment ives.tiff.

If the categories are known in advance (right panel in
ives.tiff), I think what he actually needs is computing
the likelihood ratio between the model

     log(lambda) = b[0] + b[1] * genome_length
           + np.dot(b[2:N+1], group[0:N-1])

and a reduced model

     log(lambda) = b[0] + np.dot(b[1:N], group[0:N-1])

That is, adding genome length as a predictor should not
improve the fit given that bacterial groups are already in
the model.

If he does not have groups, but some sort of dendrogram
(left panel in ives.tiff), perhaps he could preprocess the
data by clustering the bacteria based on his dendrogram?

A full dendrogram (e.g. used as nested log-linear model)
would overfit the data and explain it perfectly. So adding
genome length would always give zero improvement. But if
the dendrogram can be reduced into a few descrete categories,
he could use a likelihood ratio test for the genome length.

Sturla