[SciPy-Dev] Least-Squares Linear Solver ( scipy.linalg.lstsq ) not optimal

Mon Feb 9 04:48:50 EST 2015

On 19/01/15 00:17, Sturla Molden wrote:
> On 18/01/15 15:10, josef.pktd at gmail.com wrote:
>
>
>> I don't know how well the current lstsq is doing since I never checked.
>> Using scipy pinv (SVD based ?) for linear regression works pretty well
>> in those cases.
>
> pinv uses SVD, but there is a big difference: You are refering to 
> using SVD on the scatter matrix X'X, whereas SciPy uses SVD on the 
> data matrix X for the least-squares solver.
>
I misunderstood Josef, he was talking about pinv(X) not pinv(X'X).

So he was talking about

    dot(pinv(X),y)

and not what I thought

    dot(dot(pinv(dot(X.T,X)),X.T),y)

The latter would be the naive transcription from a linear regression 
textbook's

    inverse(X' X) X' y

So what is the difference between lstsq(X,y) and dot(pinv(X),y)? It 
turns out: "almost nothing".

If we look at the equations that LAPACK solves, then lstsq(X,y) with SVD 
actually forms pinv(X) internally in LAPACK and then matrix multiplies 
with y. But if we do dot(pinv(X),y) then this has a small amount of 
redundant computation in it: np.linalg.pinv(X) calls lstsq with X and an 
indentity matrix. So internally LAPACK forms pinv(x) and then multiplies 
pinv(X) with I to return pinv(X). Calling lstsq(X,y) instead of 
dot(pinv(X),y) avoids this one redundant matrix multiplication, but 
otherwise they are identical from a computational perspective. The dot 
product in the latter would be done internally in LAPACK in the former, 
calling the same BLAS function, so that is just some constant Python 
overhead. The only numerical difference then is the multiplication of 
pinv(X) with an identity matrix internally in LAPACK.

Sturla