[SciPy-Dev] Least-Squares Linear Solver ( scipy.linalg.lstsq ) not optimal

Mon Jan 19 08:51:12 EST 2015

On Mon, Jan 19, 2015 at 3:53 AM, Alexander Grigorievskiy <
alex.grigorievskiy at gmail.com> wrote:

> Hi Sturla, Josef
>
> Thanks for your comments.
>
> On 01/19/2015 12:50 AM, Sturla Molden wrote:
> > The "fastest" lapack least-squares solver (apart from using Cholesky,
> > which is even faster), is *gels which uses QR or LQ factorization. If
> > the data is rank n x p in C order *gels can fit least squares with LQ,
> > which thus avoids the transpose to n x p in Fortran order.
> >
> > *gelss et al. can only work with data rank n x p in Fortran order.
> >
> > Which SVD-based solver is the faster depends on the hardware and the
> > LAPACK library.
> >
> > Correctness also beats speed. It is wrong to say that the fastest
> > least-squares solver is the better, in which case we should be using
> > Cholesky factorization.
> >
> >
> > Sturla
> >
>
> In the LAPACK documentation it is explicitly written that "gels" can
> solve only full rank problems. "gelss", "gelsd" ,"gelsy" can solve not
> full rank,
> but both "gelsy" and "gelsd" are faster than "gelss" (currently used),
> however they might require
> some more space.
> http://www.netlib.org/lapack/lug/node27.html
>
> My experiments show the same. By the way were you able to see the
> attachment
> in my first letter?
>
> In the experiments I solved the system Ax=b, where A is (m*n). I varied
> m, I took n = 2/3*m, and rank r=1/2*m.
> Then I found x, measured time and compared solutions (ny max norm)
> between difference methods to monitor the accuracy.
>

I didn't read carefully enough and missed the rank part in the bottom of
the graph.

The speed improvements do look impressive, and the Lapack node seems to
indicate that it is a more modern version for solving this.

Our most common case would be fixed number of columns and increasing number
of rows.

>
> I agree with the fact the the speed may depend on hardware and LAPACK
> library. But I use
> the standard modern computer and standard LAPACK, which is probably the
> most
> frequent use case for SciPy. I guess the goal of SciPy is not to
> optimize for ScaLAPACK, MAGMA and so forth.
>
> I have also monitored the accuracy, as I wrote it is practically the
> same for all methods < 2*10e-16.
> I apply the graph of the accuracies to this letter (sorry for the legend
> it is doubled), where the max-norm difference
> between various methods is shown.
>
> > brain-dead ?   (numerical mafia?)
> >
> > Numerically significant precision problems doesn't mean that they are
> > "statistically" important.
> > :)
> >
> > If you have to worry about numerical precision in statistical
> > analysis, then (most of the time) you are screwed already much earlier
> > and you better rethink your choice of models or statistical method.
> >
> > inv(x'x) is perfectly fine, the warning in the old statistical
> > literature starts at a condition number of 30 !
> > (I think I picked a condition number of a few 1000 to print the
> > warning in statsmodels.)
> >
> >
> > However, statsmodels uses pinv(x) which uses svd(x), and allows by
> > default for singular x. The alternative method uses QR which fails on
> > singular x.
> >
> > np.linalg.pinv is doing reasonably ok on the toughest NIST problem,
> > but it's a piece of cake if we standardize the x values beforehand.
> > scipy.linalg.pinv was doing a little bit better at default settings.
> >
> >
> > Josef
> >
>
> Josef, I think you can solve the least-squares by SVD and
> pseudo-inverse, but this is not a direct solution.
> So, first you find the SVD, then pseudo-inverse and then solve
> least-squares, rather then directly solving least-squares
> (although maybe by similar methods). I am not saying that this is wrong,
> but then what for there are separate functions
> to solve least-squares in LAPACK, SciPy and NumPy?
> My desire is to improve the diretc least-square solver lstsq by calling
> the appropriate LAPACK function.
>

This was mainly an aside in my initial message to illustrate that I checked
the behavior of an svd version in the singular and near-singular case, and
it is relatively easy to understand what regularization is used in those
cases. Without checking I wouldn't know what the "complete orthogonal
factorization" in gelsy is doing.

another aside which is not really relevant for the issue
In statsmodels,
in the main use case OLS we also need additional results like inv(x'x) (or
a numerically stable version of it) and rank, and even pinv is already a
bit high-level for this and using raw svd might save us some calculations.
In other cases we want to solve for different right hand sides which we
might not know in advance, pinv is my or our preferred solution.
The least common case is straightforward linalg.lstsq where we only need
the solution to one set of linear equations.

We also use np.linalg.solve (which according to the docstring used dgesv)
for simple inv(a) dot b problems in matrix equations.

Overall, I think more benchmarks and evaluating different Lapack solutions
like you did is useful. (Especially for users like my who have only a rough
idea about the linalg jungle, i.e. myriad possibilities to do the "same"
thing.)

I suggest you submit a PR to scipy and the linalg developers/maintainers
will look at the details.

Josef

>
> Best regards,
>     Alexander Grigorevskiy, PhD student, Aalto University
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20150119/81eef159/attachment.html>