[SciPy-User] scipy.linalg.lstsq

Tim Valenta tonightslastsong at gmail.com
Tue Sep 14 14:10:19 EDT 2010


Hello all,

Longtime Python user here.  I'm a computer science guy, went through mostly
calculus and discrete math courses-- never much in pure statistics.  Which
leads me into my dilemma:

I'm doing least-square linear regression on a scatter plot of points, where
several inputs lead to one output.  That is, it's not just one system of x,y
coordinates, but instead several parallel systems something more like (x1,y)
(x2, y) (x3, y), upwards to (xn, y).  I'm attempting to get my company off
of a severely broken calculation engine, but throwing my data at
linalg.lstsq or stats.linregress isn't just a magic one-step fix to my
desired solution.

Basically, the company ends up using data from Microsoft Office 2003 Excel,
which can do hardly-configurable regression calculation on a matrix of data,
each row formatted like y, x1, x2, x3 (to follow my example given above).
 It spits out all kinds of variables and data, and the company draws on the
calculated 'coefficients' to make a pie graph, the coefficients taken as
percentages.  The Excel coefficient numbers come out as something in the
range of (-1,1).  Their goal is to take the 2 or 3 largest of the
coefficients and look at the corresponding xn value (that is, x1, x2, or x3)
to decide which of the x values is most influential in the resulting y.

So first:

linalg.lstsq gives me back four variables, none of which, but `rank`, do I
completely understand.  The first returned value, `x`, claims to be a
vector, but I'm a little lost on that.  If it's a vector, I assume it's
projecting from the origin point?  But that seems too easy, and likely
incorrect, but I've got no `intercept` value to observe.  Is this capable of
giving me the result I'm trying to chase?

Second:

stats.linregress gives me much more familiar data, but seems to only compute
one (x,y) plot at a time, leading me to conclude that I should do 3 separate
computations: linregress with all (x1,y) data points, then again with (x2,y)
points, and then again with (x3,y) points.  Using this method, I get slopes,
intercepts, r-values, etc.  Is it then up to me to minimize the r^2 value?
 Simply my lack of exposure to stats leaves me unsure of this step.

Third:

Is the procedure my company uses totally flawed?  That is, if I want to
determine which of x1, x2, and x3 is most influential on y, across all
samples, is there a more direct calculation that yields the correct
conclusion?

Thanks in advance-- I'm eager to wrap my head around this!  I just need some
direction as I continue to play with scipy.

Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100914/01e0be4b/attachment.html>


More information about the SciPy-User mailing list