[SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?

Fri Mar 18 16:09:30 EDT 2016

On Fri, Mar 18, 2016 at 3:56 PM, Matthew Brett <matthew.brett at gmail.com>
wrote:

> On Fri, Mar 18, 2016 at 12:19 PM,  <josef.pktd at gmail.com> wrote:
> >
> >
> > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett <matthew.brett at gmail.com>
> > wrote:
> >>
> >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris
> >> <charlesr.harris at gmail.com> wrote:
> >> >
> >> >
> >> > On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
> >> >> <charlesr.harris at gmail.com> wrote:
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett
> >> >>> <matthew.brett at gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hi,
> >> >>>>
> >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
> >> >>>> <charlesr.harris at gmail.com> wrote:
> >> >>>> >
> >> >>>> >
> >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
> >> >>>> > <charlesr.harris at gmail.com> wrote:
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
> >> >>>> >> <matthew.brett at gmail.com>
> >> >>>> >> wrote:
> >> >>>> >>>
> >> >>>> >>> Hi,
> >> >>>> >>>
> >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
> >> >>>> >>> <ralf.gommers at gmail.com>
> >> >>>> >>> wrote:
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
> >> >>>> >>> > <matthew.brett at gmail.com>
> >> >>>> >>> > wrote:
> >> >>>> >>> >>
> >> >>>> >>> >> Hi,
> >> >>>> >>> >>
> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for
> >> >>>> >>> >> scipy
> >> >>>> >>> >> 0.17.0:
> >> >>>> >>> >>
> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
> >> >>>> >>> >>
> >> >>>> >>> >> This turns out to be due to small changes in the result of
> >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
> >> >>>> >>> >> platform
> >> >>>> >>> >> - the travis Ubuntu in this case.
> >> >>>> >>> >>
> >> >>>> >>> >> Is that expected?   I haven't investigated deeply, but I
> >> >>>> >>> >> believe
> >> >>>> >>> >> the
> >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly
> more
> >> >>>> >>> >> accurate.  Is that possible?
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> > That's very well possible, because we changed the underlying
> >> >>>> >>> > LAPACK
> >> >>>> >>> > routine
> >> >>>> >>> > that's used:
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> >
> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
> .
> >> >>>> >>> > This PR also had some lstsq precision issues, so could be
> >> >>>> >>> > useful:
> >> >>>> >>> > https://github.com/scipy/scipy/pull/5550
> >> >>>> >>> >
> >> >>>> >>> > What is the accuracy difference in your case?
> >> >>>> >>>
> >> >>>> >>> Damn - I thought you might ask me that!
> >> >>>> >>>
> >> >>>> >>> The test where it showed up comes from some calculations on a
> >> >>>> >>> simple
> >> >>>> >>> regression problem.
> >> >>>> >>>
> >> >>>> >>> The regression model is $y = X B + e$, with ordinary least
> >> >>>> >>> squares
> >> >>>> >>> fit
> >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix
> >> >>>> >>> X.
> >> >>>> >>>
> >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy
> >> >>>> >>> 0.16
> >> >>>> >>> pinv, compared to scipy 0.17 pinv:
> >> >>>> >>>
> >> >>>> >>> # SSE using high-precision sympy pinv
> >> >>>> >>> 72.0232781366615313
> >> >>>> >>> # SSE using scipy 0.16 pinv
> >> >>>> >>> 72.0232781366625261
> >> >>>> >>> # SSE using scipy 0.17 pinv
> >> >>>> >>> 72.0232781390480170
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> That difference looks pretty big, looks almost like a cutoff
> >> >>>> >> effect.
> >> >>>> >> What
> >> >>>> >> is the matrix X, in particular, what is its condition number or
> >> >>>> >> singular
> >> >>>> >> values? Might also be interesting to see residuals for the
> >> >>>> >> different
> >> >>>> >> functions.
> >> >>>> >>
> >> >>>> >> <snip>
> >> >>>> >>
> >> >>>> >
> >> >>>> > Note that this might be an indication that the result of the
> >> >>>> > calculation has
> >> >>>> > an intrinsically large error bar.
> >> >>>>
> >> >>>> In [6]: np.linalg.svd(X, compute_uv=False)
> >> >>>> Out[6]:
> >> >>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
> >> >>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
> >> >>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
> >> >>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
> >> >>>>          1.72675581e-03,   9.51576795e-04])
> >> >>>>
> >> >>>> In [7]: np.linalg.cond(X)
> >> >>>> Out[7]: 73278544935.300507
> >> >>>
> >> >>>
> >> >>> Condition number is about 7e10, so you might expect to lose about 10
> >> >>> digits of of accuracy just due to roundoff.  The covariance of the
> >> >>> fitting
> >> >>> error will be `inv(X.T, X) * error`, so you can look at the
> diagonals
> >> >>> for
> >> >>> the variance of the fitted parameters just for kicks, taking `error`
> >> >>> to be
> >> >>> roundoff error or measurement error. This is not to say that the
> >> >>> results
> >> >>> should not be more reproducible, but should be of interest for the
> >> >>> validity
> >> >>> of the solution. Note that I suspect that there may be a  difference
> >> >>> in the
> >> >>> default `rcond` used, or maybe the matrix is on the edge and a small
> >> >>> change
> >> >>> in results changes the effective rank.  You can run with the
> >> >>> `return_rank=True` argument to check that. Looks like machine
> >> >>> precision is
> >> >>> used for the default value of rcond, so experimenting with that
> >> >>> probable
> >> >>> won't be informative, but you might want to try it anyway.
> >> >>
> >> >>
> >> >>
> >> >> statsmodels is using numpy pinv for the linear model and it works in
> >> >> almost all cases very well *).
> >> >>
> >> >> Using pinv with singular (given the rcond threshold) X makes
> everything
> >> >> well defined and not subject to numerical noise because of the
> >> >> regularization induced by pinv.  Our covariance for the parameter
> >> >> estimates,
> >> >> based on pinv(X), is also not noisy but it's of reduced rank.
> >> >
> >> >
> >> > What I'm thinking is that changing the algorithm, as here, is like
> >> > introducing noise relative to the original, so even if the result is
> >> > perfectly valid, the fitted parameters might change significantly. A
> >> > change
> >> > in the 11-th digit wouldn't surprise me at all, depending on the
> actual
> >> > X
> >> > and rhs.
> >>
> >> I believe the errors $e = y - X^+ y$ are in general robust to
> >> differences between valid pinv solutions, even in extreme cases:
> >>
> >> In [2]: print_scalar(sse(X, npl.pinv(X), y))
> >> 72.0232781366616024
> >>
> >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y))
> >> 72.0232781366939463
> >>
> >> In [4]: npl.cond(np.c_[X, X])
> >> Out[4]: 6.6247965335098062e+26
> >
> >
> >
> > I think this does not depend on the smallest eigenvalues which should be
> > regularized away in cases like this.
> > The potential problem is with eigenvalues that are just a but too large
> for
> > rcond to remove them but still have numerical problems.
> >
> > I think it's a similar issue about the threshold as the one that you
> raised
> > some time ago with determining the matrix rank for near singular
> matrices.
>
> My point was (I think you're agreeing) that the accuracy of the pinv
> does not depend in a simple way on the condition number of the input
> matrix.
>

That's not the point I wanted to make.
The condition number doesn't matter if there is an singular/eigenvalue with
rcond 1e-16 or 1e-30 or 1e-100.
However, the numerical accuracy might depend on the condition number when
there is a singular value with rcond of 2e-15 or 5e-15.
I didn't manage to find the default cond or rcond for scipy.linalg.pinv and
lstsq. The problem I had was with numpy's pinv.
I think there is a grey zone in rcond of eigenvalues that might be bad
numerically.

(My only experience is with SVD based pinv.)

Josef

>
> I'm not quite sure how to assess the change in accuracy of scipy 0.17
> on Linux, but it does seem to me to be significantly different (and
> worse than) the other options (like numpy, scipy 0.16)
>
> Matthew
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160318/cf98902a/attachment.html>