[SciPy-User] Generalized least square on large dataset

Nathaniel Smith njs at pobox.com
Fri Mar 9 14:46:07 EST 2012


On Fri, Mar 9, 2012 at 7:30 PM, Peter Cimermančič
<peter.cimermancic at gmail.com> wrote:
> Sure, please see attached. Bacteria.jpg is the plot we're talking about. As
> you can see there is a nice correlation in the graph, but I'm afraid there
> might something like in the second figure (ives.jpg) going on. The second
> figure is from Ives and Zhu; Statistics for correlated data: phylogenies,
> space and time (2006).

So in the figure from Ives and Zhu, the two variables do seem to be
well-correlated across groups, but then within individual groups they
aren't well-correlated. Is that what you're worried about -- that gene
count and genome length might be correlated overall, but not within
individual groups?

Because GLS doesn't actually address that question. It lets you
correct your p-values for the fact that similarity between bacteria
means that you effectively have somewhat less data than it would
otherwise appear, and thus your p-values should be larger than they
would be in a naive analysis. But it'd still be a p-value on whether
the two variables are correlated overall. (Which they obviously
are...)

-- Nathaniel



More information about the SciPy-User mailing list