[SciPy-Dev] Least-Squares Linear Solver ( scipy.linalg.lstsq ) not optimal

josef.pktd at gmail.com josef.pktd at gmail.com
Mon Feb 9 08:29:29 EST 2015


On Mon, Feb 9, 2015 at 4:48 AM, Sturla Molden <sturla.molden at gmail.com> wrote:
> On 19/01/15 00:17, Sturla Molden wrote:
>> On 18/01/15 15:10, josef.pktd at gmail.com wrote:
>>
>>
>>> I don't know how well the current lstsq is doing since I never checked.
>>> Using scipy pinv (SVD based ?) for linear regression works pretty well
>>> in those cases.
>>
>> pinv uses SVD, but there is a big difference: You are refering to
>> using SVD on the scatter matrix X'X, whereas SciPy uses SVD on the
>> data matrix X for the least-squares solver.
>>
> I misunderstood Josef, he was talking about pinv(X) not pinv(X'X).
>
> So he was talking about
>
>     dot(pinv(X),y)
>
> and not what I thought
>
>     dot(dot(pinv(dot(X.T,X)),X.T),y)
>
> The latter would be the naive transcription from a linear regression
> textbook's
>
>     inverse(X' X) X' y
>
> So what is the difference between lstsq(X,y) and dot(pinv(X),y)? It
> turns out: "almost nothing".
>
> If we look at the equations that LAPACK solves, then lstsq(X,y) with SVD
> actually forms pinv(X) internally in LAPACK and then matrix multiplies
> with y. But if we do dot(pinv(X),y) then this has a small amount of
> redundant computation in it: np.linalg.pinv(X) calls lstsq with X and an
> indentity matrix. So internally LAPACK forms pinv(x) and then multiplies
> pinv(X) with I to return pinv(X). Calling lstsq(X,y) instead of
> dot(pinv(X),y) avoids this one redundant matrix multiplication, but
> otherwise they are identical from a computational perspective. The dot
> product in the latter would be done internally in LAPACK in the former,
> calling the same BLAS function, so that is just some constant Python
> overhead. The only numerical difference then is the multiplication of
> pinv(X) with an identity matrix internally in LAPACK.

in statsmodels we are using mostly np.linalg.pinv which is just an svd
and a dot product
(no identity matrix)

https://github.com/numpy/numpy/blob/master/numpy/linalg/linalg.py#L1582

or, more precisely, IIRC, for OLS we use our copy of it that also
returns the singular values
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tools/tools.py#L381

np.linalg.pinv was considerably faster than scipy.linalg.pinv when I
checked it a few years ago
(outdated information because code in scipy has changed).

(However, there is still a little bit of slack in our usage of pinv/SVD.)

Josef

>
> Sturla
>
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev



More information about the SciPy-Dev mailing list