[SciPy-Dev] Least-Squares Linear Solver ( scipy.linalg.lstsq ) not optimal
Sturla Molden
sturla.molden at gmail.com
Mon Feb 9 04:48:50 EST 2015
On 19/01/15 00:17, Sturla Molden wrote:
> On 18/01/15 15:10, josef.pktd at gmail.com wrote:
>
>
>> I don't know how well the current lstsq is doing since I never checked.
>> Using scipy pinv (SVD based ?) for linear regression works pretty well
>> in those cases.
>
> pinv uses SVD, but there is a big difference: You are refering to
> using SVD on the scatter matrix X'X, whereas SciPy uses SVD on the
> data matrix X for the least-squares solver.
>
I misunderstood Josef, he was talking about pinv(X) not pinv(X'X).
So he was talking about
dot(pinv(X),y)
and not what I thought
dot(dot(pinv(dot(X.T,X)),X.T),y)
The latter would be the naive transcription from a linear regression
textbook's
inverse(X' X) X' y
So what is the difference between lstsq(X,y) and dot(pinv(X),y)? It
turns out: "almost nothing".
If we look at the equations that LAPACK solves, then lstsq(X,y) with SVD
actually forms pinv(X) internally in LAPACK and then matrix multiplies
with y. But if we do dot(pinv(X),y) then this has a small amount of
redundant computation in it: np.linalg.pinv(X) calls lstsq with X and an
indentity matrix. So internally LAPACK forms pinv(x) and then multiplies
pinv(X) with I to return pinv(X). Calling lstsq(X,y) instead of
dot(pinv(X),y) avoids this one redundant matrix multiplication, but
otherwise they are identical from a computational perspective. The dot
product in the latter would be done internally in LAPACK in the former,
calling the same BLAS function, so that is just some constant Python
overhead. The only numerical difference then is the multiplication of
pinv(X) with an identity matrix internally in LAPACK.
Sturla
More information about the SciPy-Dev
mailing list