[SciPy-Dev] Least-Squares Linear Solver ( scipy.linalg.lstsq ) not optimal

Mon Jan 19 03:53:47 EST 2015

Hi Sturla, Josef

Thanks for your comments.

On 01/19/2015 12:50 AM, Sturla Molden wrote:
> The "fastest" lapack least-squares solver (apart from using Cholesky, 
> which is even faster), is *gels which uses QR or LQ factorization. If 
> the data is rank n x p in C order *gels can fit least squares with LQ, 
> which thus avoids the transpose to n x p in Fortran order.
>
> *gelss et al. can only work with data rank n x p in Fortran order.
>
> Which SVD-based solver is the faster depends on the hardware and the 
> LAPACK library.
>
> Correctness also beats speed. It is wrong to say that the fastest 
> least-squares solver is the better, in which case we should be using 
> Cholesky factorization.
>
>
> Sturla
>

In the LAPACK documentation it is explicitly written that "gels" can
solve only full rank problems. "gelss", "gelsd" ,"gelsy" can solve not
full rank,
but both "gelsy" and "gelsd" are faster than "gelss" (currently used),
however they might require
some more space.
http://www.netlib.org/lapack/lug/node27.html

My experiments show the same. By the way were you able to see the
attachment
in my first letter?

In the experiments I solved the system Ax=b, where A is (m*n). I varied
m, I took n = 2/3*m, and rank r=1/2*m.
Then I found x, measured time and compared solutions (ny max norm)
between difference methods to monitor the accuracy.

I agree with the fact the the speed may depend on hardware and LAPACK
library. But I use
the standard modern computer and standard LAPACK, which is probably the most
frequent use case for SciPy. I guess the goal of SciPy is not to
optimize for ScaLAPACK, MAGMA and so forth. 

I have also monitored the accuracy, as I wrote it is practically the
same for all methods < 2*10e-16.
I apply the graph of the accuracies to this letter (sorry for the legend
it is doubled), where the max-norm difference
between various methods is shown.

> brain-dead ?   (numerical mafia?)
>
> Numerically significant precision problems doesn't mean that they are
> "statistically" important.
> :)
>
> If you have to worry about numerical precision in statistical
> analysis, then (most of the time) you are screwed already much earlier
> and you better rethink your choice of models or statistical method.
>
> inv(x'x) is perfectly fine, the warning in the old statistical
> literature starts at a condition number of 30 !
> (I think I picked a condition number of a few 1000 to print the
> warning in statsmodels.)
>
>
> However, statsmodels uses pinv(x) which uses svd(x), and allows by
> default for singular x. The alternative method uses QR which fails on
> singular x.
>
> np.linalg.pinv is doing reasonably ok on the toughest NIST problem,
> but it's a piece of cake if we standardize the x values beforehand.
> scipy.linalg.pinv was doing a little bit better at default settings.
>
>
> Josef
>

Josef, I think you can solve the least-squares by SVD and
pseudo-inverse, but this is not a direct solution.
So, first you find the SVD, then pseudo-inverse and then solve
least-squares, rather then directly solving least-squares
(although maybe by similar methods). I am not saying that this is wrong,
but then what for there are separate functions
to solve least-squares in LAPACK, SciPy and NumPy?
My desire is to improve the diretc least-square solver lstsq by calling
the appropriate LAPACK function.

Best regards,
    Alexander Grigorevskiy, PhD student, Aalto University

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ls_test_1_accuracies.png
Type: image/png
Size: 69600 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20150119/09ace33b/attachment.png>