[SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?

Wed Mar 16 23:49:29 EDT 2016

On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:

>
>
> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
>>
>>
>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>>> <charlesr.harris at gmail.com> wrote:
>>> >
>>> >
>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>>> > <charlesr.harris at gmail.com> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett <
>>> matthew.brett at gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <
>>> ralf.gommers at gmail.com>
>>> >>> wrote:
>>> >>> >
>>> >>> >
>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>>> >>> > <matthew.brett at gmail.com>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> Hi,
>>> >>> >>
>>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>>> >>> >> 0.17.0:
>>> >>> >>
>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>> >>> >>
>>> >>> >> This turns out to be due to small changes in the result of
>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
>>> platform
>>> >>> >> - the travis Ubuntu in this case.
>>> >>> >>
>>> >>> >> Is that expected?   I haven't investigated deeply, but I believe
>>> the
>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more
>>> >>> >> accurate.  Is that possible?
>>> >>> >
>>> >>> >
>>> >>> > That's very well possible, because we changed the underlying LAPACK
>>> >>> > routine
>>> >>> > that's used:
>>> >>> >
>>> >>> >
>>> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
>>> .
>>> >>> > This PR also had some lstsq precision issues, so could be useful:
>>> >>> > https://github.com/scipy/scipy/pull/5550
>>> >>> >
>>> >>> > What is the accuracy difference in your case?
>>> >>>
>>> >>> Damn - I thought you might ask me that!
>>> >>>
>>> >>> The test where it showed up comes from some calculations on a simple
>>> >>> regression problem.
>>> >>>
>>> >>> The regression model is $y = X B + e$, with ordinary least squares
>>> fit
>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>>> >>>
>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>>> >>> pinv, compared to scipy 0.17 pinv:
>>> >>>
>>> >>> # SSE using high-precision sympy pinv
>>> >>> 72.0232781366615313
>>> >>> # SSE using scipy 0.16 pinv
>>> >>> 72.0232781366625261
>>> >>> # SSE using scipy 0.17 pinv
>>> >>> 72.0232781390480170
>>> >>
>>> >>
>>> >> That difference looks pretty big, looks almost like a cutoff effect.
>>> What
>>> >> is the matrix X, in particular, what is its condition number or
>>> singular
>>> >> values? Might also be interesting to see residuals for the different
>>> >> functions.
>>> >>
>>> >> <snip>
>>> >>
>>> >
>>> > Note that this might be an indication that the result of the
>>> calculation has
>>> > an intrinsically large error bar.
>>>
>>> In [6]: np.linalg.svd(X, compute_uv=False)
>>> Out[6]:
>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>>>          1.72675581e-03,   9.51576795e-04])
>>>
>>> In [7]: np.linalg.cond(X)
>>> Out[7]: 73278544935.300507
>>>
>>
>> Condition number is about 7e10, so you might expect to lose about 10
>> digits of of accuracy just due to roundoff.  The covariance of the fitting
>> error will be `inv(X.T, X) * error`, so you can look at the diagonals for
>> the variance of the fitted parameters just for kicks, taking `error` to be
>> roundoff error or measurement error. This is not to say that the results
>> should not be more reproducible, but should be of interest for the validity
>> of the solution. Note that I suspect that there may be a  difference in the
>> default `rcond` used, or maybe the matrix is on the edge and a small change
>> in results changes the effective rank.  You can run with the
>> `return_rank=True` argument to check that. Looks like machine precision is
>> used for the default value of rcond, so experimenting with that probable
>> won't be informative, but you might want to try it anyway.
>>
>
>
> statsmodels is using numpy pinv for the linear model and it works in
> almost all cases very well *).
>
> Using pinv with singular (given the rcond threshold) X makes everything
> well defined and not subject to numerical noise because of the
> regularization induced by pinv.  Our covariance for the parameter
> estimates, based on pinv(X), is also not noisy but it's of reduced rank.
>

What I'm thinking is that changing the algorithm, as here, is like
introducing noise relative to the original, so even if the result is
perfectly valid, the fitted parameters might change significantly. A change
in the 11-th digit wouldn't surprise me at all, depending on the actual X
and rhs.

>
> The main problem I have seen so far is that the default rcond is too small
> for some cases, and it breaks down with near singular X with rcond a bit
> larger than 1e-15 in an example with dummy variables.
>
> https://github.com/statsmodels/statsmodels/issues/2628
>
> http://stackoverflow.com/questions/32599159/robustness-issue-of-statsmodel-linear-regression-ols-python
>
>
> If I think a bit longer, then I remember two other cases where pinv with
> ill conditioned X also doesn't work very well with continuous almost
> collinear data, one NIST example with badly scaled polynomials where
> precision goes down to a few decimals and a very small economic dataset
> where the normal equations (or orthogonality conditions?) are not satisfied
> at a good precision..
>
>
Yeah, it can make a real mess with polynomial fits ;)

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160316/4a7bf9ff/attachment.html>