[SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?

Fri Mar 18 15:07:39 EDT 2016

On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
>>
>>
>>
>> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>>>
>>>
>>>
>>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett <matthew.brett at gmail.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>>>> <charlesr.harris at gmail.com> wrote:
>>>> >
>>>> >
>>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>>>> > <charlesr.harris at gmail.com> wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
>>>> >> <matthew.brett at gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
>>>> >>> <ralf.gommers at gmail.com>
>>>> >>> wrote:
>>>> >>> >
>>>> >>> >
>>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>>>> >>> > <matthew.brett at gmail.com>
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> Hi,
>>>> >>> >>
>>>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>>>> >>> >> 0.17.0:
>>>> >>> >>
>>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>>> >>> >>
>>>> >>> >> This turns out to be due to small changes in the result of
>>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
>>>> >>> >> platform
>>>> >>> >> - the travis Ubuntu in this case.
>>>> >>> >>
>>>> >>> >> Is that expected?   I haven't investigated deeply, but I believe
>>>> >>> >> the
>>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more
>>>> >>> >> accurate.  Is that possible?
>>>> >>> >
>>>> >>> >
>>>> >>> > That's very well possible, because we changed the underlying
>>>> >>> > LAPACK
>>>> >>> > routine
>>>> >>> > that's used:
>>>> >>> >
>>>> >>> >
>>>> >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements.
>>>> >>> > This PR also had some lstsq precision issues, so could be useful:
>>>> >>> > https://github.com/scipy/scipy/pull/5550
>>>> >>> >
>>>> >>> > What is the accuracy difference in your case?
>>>> >>>
>>>> >>> Damn - I thought you might ask me that!
>>>> >>>
>>>> >>> The test where it showed up comes from some calculations on a simple
>>>> >>> regression problem.
>>>> >>>
>>>> >>> The regression model is $y = X B + e$, with ordinary least squares
>>>> >>> fit
>>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>>>> >>>
>>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>>>> >>> pinv, compared to scipy 0.17 pinv:
>>>> >>>
>>>> >>> # SSE using high-precision sympy pinv
>>>> >>> 72.0232781366615313
>>>> >>> # SSE using scipy 0.16 pinv
>>>> >>> 72.0232781366625261
>>>> >>> # SSE using scipy 0.17 pinv
>>>> >>> 72.0232781390480170
>>>> >>
>>>> >>
>>>> >> That difference looks pretty big, looks almost like a cutoff effect.
>>>> >> What
>>>> >> is the matrix X, in particular, what is its condition number or
>>>> >> singular
>>>> >> values? Might also be interesting to see residuals for the different
>>>> >> functions.
>>>> >>
>>>> >> <snip>
>>>> >>
>>>> >
>>>> > Note that this might be an indication that the result of the
>>>> > calculation has
>>>> > an intrinsically large error bar.
>>>>
>>>> In [6]: np.linalg.svd(X, compute_uv=False)
>>>> Out[6]:
>>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>>>>          1.72675581e-03,   9.51576795e-04])
>>>>
>>>> In [7]: np.linalg.cond(X)
>>>> Out[7]: 73278544935.300507
>>>
>>>
>>> Condition number is about 7e10, so you might expect to lose about 10
>>> digits of of accuracy just due to roundoff.  The covariance of the fitting
>>> error will be `inv(X.T, X) * error`, so you can look at the diagonals for
>>> the variance of the fitted parameters just for kicks, taking `error` to be
>>> roundoff error or measurement error. This is not to say that the results
>>> should not be more reproducible, but should be of interest for the validity
>>> of the solution. Note that I suspect that there may be a  difference in the
>>> default `rcond` used, or maybe the matrix is on the edge and a small change
>>> in results changes the effective rank.  You can run with the
>>> `return_rank=True` argument to check that. Looks like machine precision is
>>> used for the default value of rcond, so experimenting with that probable
>>> won't be informative, but you might want to try it anyway.
>>
>>
>>
>> statsmodels is using numpy pinv for the linear model and it works in
>> almost all cases very well *).
>>
>> Using pinv with singular (given the rcond threshold) X makes everything
>> well defined and not subject to numerical noise because of the
>> regularization induced by pinv.  Our covariance for the parameter estimates,
>> based on pinv(X), is also not noisy but it's of reduced rank.
>
>
> What I'm thinking is that changing the algorithm, as here, is like
> introducing noise relative to the original, so even if the result is
> perfectly valid, the fitted parameters might change significantly. A change
> in the 11-th digit wouldn't surprise me at all, depending on the actual X
> and rhs.

I believe the errors $e = y - X^+ y$ are in general robust to
differences between valid pinv solutions, even in extreme cases:

In [2]: print_scalar(sse(X, npl.pinv(X), y))
72.0232781366616024

In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y))
72.0232781366939463

In [4]: npl.cond(np.c_[X, X])
Out[4]: 6.6247965335098062e+26

Cheers,

Matthew