[SciPy-User] scipy.linalg.lstsq

Sat Sep 25 15:02:44 EDT 2010

On Tue, Sep 14, 2010 at 2:10 PM, Tim Valenta <tonightslastsong at gmail.com> wrote:
> Hello all,
> Longtime Python user here.  I'm a computer science guy, went through mostly
> calculus and discrete math courses-- never much in pure statistics.  Which
> leads me into my dilemma:
> I'm doing least-square linear regression on a scatter plot of points, where
> several inputs lead to one output.  That is, it's not just one system of x,y
> coordinates, but instead several parallel systems something more like (x1,y)
> (x2, y) (x3, y), upwards to (xn, y).  I'm attempting to get my company off
> of a severely broken calculation engine, but throwing my data at
> linalg.lstsq or stats.linregress isn't just a magic one-step fix to my
> desired solution.

11 days delay until it showed up on the list for me. Here are some
comments in case they are still relevant.

> Basically, the company ends up using data from Microsoft Office 2003 Excel,
> which can do hardly-configurable regression calculation on a matrix of data,
> each row formatted like y, x1, x2, x3 (to follow my example given above).

>From your above description it's not really clear, whether you have
different observations or variables.

the excel matrix looks more like you have rows of observations and
several explanatory variables in the columns

>  It spits out all kinds of variables and data, and the company draws on the
> calculated 'coefficients' to make a pie graph, the coefficients taken as
> percentages.  The Excel coefficient numbers come out as something in the
> range of (-1,1).  Their goal is to take the 2 or 3 largest of the
> coefficients and look at the corresponding xn value (that is, x1, x2, or x3)
> to decide which of the x values is most influential in the resulting y.

If the excel coefficients in the range (-1,1), that would be "unusual"
for regression coefficients. Do the x1, x2, x3, ... all use the same
scale? Otherwise it wouldn't make sense to compare the regression
coefficients directly.

> So first:
> linalg.lstsq gives me back four variables, none of which, but `rank`, do I
> completely understand.  The first returned value, `x`, claims to be a
> vector, but I'm a little lost on that.  If it's a vector, I assume it's
> projecting from the origin point?  But that seems too easy, and likely
> incorrect, but I've got no `intercept` value to observe.  Is this capable of
> giving me the result I'm trying to chase?
> Second:
> stats.linregress gives me much more familiar data, but seems to only compute
> one (x,y) plot at a time, leading me to conclude that I should do 3 separate
> computations: linregress with all (x1,y) data points, then again with (x2,y)
> points, and then again with (x3,y) points.  Using this method, I get slopes,
> intercepts, r-values, etc.  Is it then up to me to minimize the r^2 value?
>  Simply my lack of exposure to stats leaves me unsure of this step.

with linalg.leastsq you would have to do some work to get all desired
results. linregress can only have one regressor.

I would recommend to use either the OLS class from the scipy cookbook,
or, better, use statsmodels.

If you really have a lot of explanatory variables, then scikits.learn
has the best algorithms to select a reduced number of them.

> Third:
> Is the procedure my company uses totally flawed?  That is, if I want to
> determine which of x1, x2, and x3 is most influential on y, across all
> samples, is there a more direct calculation that yields the correct
> conclusion?

I don't understand enough of what the selection is actually supposed
to do, to make a comment on this.

Josef

> Thanks in advance-- I'm eager to wrap my head around this!  I just need some
> direction as I continue to play with scipy.
> Tim
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>