From gfyoung17 at gmail.com Thu Mar 3 05:01:30 2016 From: gfyoung17 at gmail.com (G Young) Date: Thu, 3 Mar 2016 10:01:30 +0000 Subject: [SciPy-Dev] "invalid" matrices in signaltools.py Message-ID: Hello all, I have a PR up here that allows functions like "convolve" and "correlate" to accept matrices in either order for the "valid" order. People seem to be in general agreement that either order should be accepted. The question is how to handle inputs where one matrix is not strictly larger than the other (i.e. the dimensions of one matrix are not all at least as large as the analogous dimension of another). My proposal is to let such arguments pass through, as the function will always return the empty matrix as expected. However, others are more in favor of throwing an Exception. What are people's thoughts? Thanks! Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Thu Mar 3 05:26:57 2016 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Thu, 03 Mar 2016 11:26:57 +0100 Subject: [SciPy-Dev] "invalid" matrices in signaltools.py In-Reply-To: References: Message-ID: <1457000817.26057.9.camel@sipsolutions.net> On Do, 2016-03-03 at 10:01 +0000, G Young wrote: > Hello all, > > I have a PR up here that allows functions like "convolve" and > "correlate" to accept matrices in either order for the "valid" order. > People seem to be in general agreement that either order should be > accepted. The question is how to handle inputs where one matrix is > not strictly larger than the other (i.e. the dimensions of one matrix > are not all at least as large as the analogous dimension of another). > > My proposal is to let such arguments pass through, as the function > will always return the empty matrix as expected. However, others are > more in favor of throwing an Exception. What are people's thoughts? > Not really my thing, but my gut feeling would be an exception as well; unless someone can think of an example where it might be useful (i.e. for empty arrays you can often construct examples by thinking of data stream chunks or groups which happen to have 0 items in them). The empty array in general does seem to loose information though. After all, you do not have an empty array of shape (3, -2), but of shape (3, 0), and even then since "valid" mode switches which array is the larger one, should it be (3, 0) or maybe (0, 2)? - Sebastian > Thanks! > > Greg > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From evgeny.burovskiy at gmail.com Thu Mar 3 11:59:23 2016 From: evgeny.burovskiy at gmail.com (Evgeni Burovski) Date: Thu, 3 Mar 2016 16:59:23 +0000 Subject: [SciPy-Dev] Fwd: [Numpy-discussion] GSoC? In-Reply-To: References: Message-ID: ---------- Forwarded message ---------- From: Ralf Gommers Date: Wed, Feb 10, 2016 at 11:02 PM Subject: Re: [Numpy-discussion] GSoC? To: Discussion of Numerical Python On Wed, Feb 10, 2016 at 11:55 PM, Ralf Gommers wrote: > > > > On Wed, Feb 10, 2016 at 11:48 PM, Chris Barker wrote: >> >> Thanks Ralf, >> >>> Note that we have always done a combined numpy/scipy ideas page and submission. For really good students numpy may be the right challenge, but in general scipy is easier to get started on. >> >> >> yup -- good idea. Is there a page ready to go, or do we need to get one up? (I don't even know where to put it...) > > > This is last year's page: https://github.com/scipy/scipy/wiki/GSoC-2015-project-ideas > > Some ideas have been worked on, others are still relevant. Let's copy this page to -2016- and start editing it and adding new ideas. I'll start right now actually. OK first version: https://github.com/scipy/scipy/wiki/GSoC-2016-project-ideas I kept some of the ideas from last year, but removed all potential mentors as the same people may not be available this year - please re-add yourselves where needed. And to everyone who has a good idea, and preferably is willing to mentor for that idea: please add it to that page. Here are a couple of vague ideas which I'd like to hash out on the list first. Both are mostly in the request-for-comment, feedback-welcome stage. (I am going to write them out even though I cannot guarantee that I'll be available to mentor either of them.) 1. Numeric differentiation. The current proposal focuses on bundling numdifftools. I think there's room for a complementary set of routines for numeric differentiation with fixed step, or limited intelligence w.r.t. the step selection. In fact, there is a fairly comprehensive implementation in scipy already, a by-product of last year's GSoC. It's not exposed publicly, and is hidden as scipy.optimize._numdiff.approx_derivative. I suspect that what's already there covers the bulk of the implementation, and what's left is a only a bit of work to spec out the user-facing API, especially for the broadcasting behavior, and then replacing semi-broken bits of scipy.misc.derivative, scipy.optimize.approx_fprime etc. The original PR has quite extensive discussion, https://github.com/scipy/scipy/pull/4884 And here's an unfinished attempt at following up on that discussion: https://gist.github.com/ev-br/7a1edb3f250bd375c46a 2. Optimization benchmarks. There are two quite large bodies of work on benchmarking various optimization algorithms: - Andrew Nelson's global optimizers benchmarks, https://github.com/scipy/scipy/pull/4191. - Nikolay Mayorov's nonliear LSQ benchmarks, some of which are available in scipy/benchmarks, and the most recent version is at https://github.com/nmayorov/lsq-benchmarks I think both can be very valuable for a broader scipy ecosystem, and both of these herculean efforts are currently blocked on the question of how to actually run them. It seems to me that it would be a useful project to come up with a standardized way of running these sorts of benchmarks and publishing the results. Ideally, it would include, * a runner. Maybe an ASV-based one, but then the project might include some enhancements to asv itself. * a way of pushlishing the results (again, asv has one, is it sufficient?) * a self-contained environment for running and publishing (a VM image? a docker container?) So that by the end of the project we would have a web page with the reference results, some automation for refreshing when needed, and a documented way of reproducing them. Thoughts? Evgeni From matthew.brett at gmail.com Sat Mar 12 00:16:12 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 11 Mar 2016 21:16:12 -0800 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? Message-ID: Hi, I just noticed that our (neuroimaging) test suite fails for scipy 0.17.0: https://travis-ci.org/matthew-brett/nipy/jobs/115471563 This turns out to be due to small changes in the result of `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform - the travis Ubuntu in this case. Is that expected? I haven't investigated deeply, but I believe the 0.16.1 result (back into the depths of time) is slightly more accurate. Is that possible? Thanks for any insight, Matthew From ralf.gommers at gmail.com Sun Mar 13 06:43:18 2016 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 13 Mar 2016 11:43:18 +0100 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett wrote: > Hi, > > I just noticed that our (neuroimaging) test suite fails for scipy 0.17.0: > > https://travis-ci.org/matthew-brett/nipy/jobs/115471563 > > This turns out to be due to small changes in the result of > `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform > - the travis Ubuntu in this case. > > Is that expected? I haven't investigated deeply, but I believe the > 0.16.1 result (back into the depths of time) is slightly more > accurate. Is that possible? > That's very well possible, because we changed the underlying LAPACK routine that's used: https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements. This PR also had some lstsq precision issues, so could be useful: https://github.com/scipy/scipy/pull/5550 What is the accuracy difference in your case? Ralf > > Thanks for any insight, > > Matthew > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From x3n0cr4735 at gmail.com Sun Mar 13 16:28:28 2016 From: x3n0cr4735 at gmail.com (Jan-Samuel Wagner) Date: Sun, 13 Mar 2016 16:28:28 -0400 Subject: [SciPy-Dev] Gaussian Process hyperparameter tuning Message-ID: Is anyone busy developing a Gaussian Process strategy for hype-parameter tuning? Seems strange there is only GridSearchCV and RandomSearchCV methods for sklearn. Please let me know. I would love to develop this feature. -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidmenhur at gmail.com Sun Mar 13 17:45:25 2016 From: davidmenhur at gmail.com (=?UTF-8?B?RGHPgGlk?=) Date: Sun, 13 Mar 2016 22:45:25 +0100 Subject: [SciPy-Dev] Gaussian Process hyperparameter tuning In-Reply-To: References: Message-ID: On 13 March 2016 at 21:28, Jan-Samuel Wagner wrote: > Is anyone busy developing a Gaussian Process strategy for hype-parameter > tuning? Seems strange there is only GridSearchCV and RandomSearchCV methods > for sklearn. The GPy package is pretty feature rich, and more complete than what is in sklearn. https://gpy.readthedocs.org/en/master/ They have implemented a gradient descent and a minibatch, and also many different kernels, sparse GP, etc. From gael.varoquaux at normalesup.org Sun Mar 13 18:37:54 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sun, 13 Mar 2016 23:37:54 +0100 Subject: [SciPy-Dev] Gaussian Process hyperparameter tuning In-Reply-To: References: Message-ID: <20160313223754.GC474178@phare.normalesup.org> On Sun, Mar 13, 2016 at 04:28:28PM -0400, Jan-Samuel Wagner wrote: > Is anyone busy developing a Gaussian Process strategy for hype-parameter > tuning? Seems strange there is only GridSearchCV and RandomSearchCV methods for > sklearn. Such question is probably better asked on the scikit-learn mailing list: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general This feature is being worked upon: https://github.com/scikit-learn/scikit-learn/pull/5491 Cheers, Ga?l From matthew.brett at gmail.com Wed Mar 16 20:05:03 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 16 Mar 2016 17:05:03 -0700 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: Hi, On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers wrote: > > > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett > wrote: >> >> Hi, >> >> I just noticed that our (neuroimaging) test suite fails for scipy 0.17.0: >> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >> >> This turns out to be due to small changes in the result of >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform >> - the travis Ubuntu in this case. >> >> Is that expected? I haven't investigated deeply, but I believe the >> 0.16.1 result (back into the depths of time) is slightly more >> accurate. Is that possible? > > > That's very well possible, because we changed the underlying LAPACK routine > that's used: > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements. > This PR also had some lstsq precision issues, so could be useful: > https://github.com/scipy/scipy/pull/5550 > > What is the accuracy difference in your case? Damn - I thought you might ask me that! The test where it showed up comes from some calculations on a simple regression problem. The regression model is $y = X B + e$, with ordinary least squares fit given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 pinv, compared to scipy 0.17 pinv: # SSE using high-precision sympy pinv 72.0232781366615313 # SSE using scipy 0.16 pinv 72.0232781366625261 # SSE using scipy 0.17 pinv 72.0232781390480170 See: https://github.com/matthew-brett/scipy-pinv/blob/master/pinv_futz.py A difference at the 9th decimal place - but this was enough to give a difference at the 5th decimal place in a more involved calculation using the same pinv result: https://github.com/nipy/nipy/pull/394/files Does that help? Cheers, Matthew From charlesr.harris at gmail.com Wed Mar 16 20:41:16 2016 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 16 Mar 2016 18:41:16 -0600 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett wrote: > Hi, > > On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers > wrote: > > > > > > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> I just noticed that our (neuroimaging) test suite fails for scipy > 0.17.0: > >> > >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 > >> > >> This turns out to be due to small changes in the result of > >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform > >> - the travis Ubuntu in this case. > >> > >> Is that expected? I haven't investigated deeply, but I believe the > >> 0.16.1 result (back into the depths of time) is slightly more > >> accurate. Is that possible? > > > > > > That's very well possible, because we changed the underlying LAPACK > routine > > that's used: > > > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements > . > > This PR also had some lstsq precision issues, so could be useful: > > https://github.com/scipy/scipy/pull/5550 > > > > What is the accuracy difference in your case? > > Damn - I thought you might ask me that! > > The test where it showed up comes from some calculations on a simple > regression problem. > > The regression model is $y = X B + e$, with ordinary least squares fit > given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. > > For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 > pinv, compared to scipy 0.17 pinv: > > # SSE using high-precision sympy pinv > 72.0232781366615313 > # SSE using scipy 0.16 pinv > 72.0232781366625261 > # SSE using scipy 0.17 pinv > 72.0232781390480170 > That difference looks pretty big, looks almost like a cutoff effect. What is the matrix X, in particular, what is its condition number or singular values? Might also be interesting to see residuals for the different functions. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Mar 16 20:48:26 2016 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 16 Mar 2016 18:48:26 -0600 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris wrote: > > > On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett > wrote: > >> Hi, >> >> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers >> wrote: >> > >> > >> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett > > >> > wrote: >> >> >> >> Hi, >> >> >> >> I just noticed that our (neuroimaging) test suite fails for scipy >> 0.17.0: >> >> >> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >> >> >> >> This turns out to be due to small changes in the result of >> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform >> >> - the travis Ubuntu in this case. >> >> >> >> Is that expected? I haven't investigated deeply, but I believe the >> >> 0.16.1 result (back into the depths of time) is slightly more >> >> accurate. Is that possible? >> > >> > >> > That's very well possible, because we changed the underlying LAPACK >> routine >> > that's used: >> > >> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements >> . >> > This PR also had some lstsq precision issues, so could be useful: >> > https://github.com/scipy/scipy/pull/5550 >> > >> > What is the accuracy difference in your case? >> >> Damn - I thought you might ask me that! >> >> The test where it showed up comes from some calculations on a simple >> regression problem. >> >> The regression model is $y = X B + e$, with ordinary least squares fit >> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. >> >> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 >> pinv, compared to scipy 0.17 pinv: >> >> # SSE using high-precision sympy pinv >> 72.0232781366615313 >> # SSE using scipy 0.16 pinv >> 72.0232781366625261 >> # SSE using scipy 0.17 pinv >> 72.0232781390480170 >> > > That difference looks pretty big, looks almost like a cutoff effect. What > is the matrix X, in particular, what is its condition number or singular > values? Might also be interesting to see residuals for the different > functions. > > > > Note that this might be an indication that the result of the calculation has an intrinsically large error bar. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Wed Mar 16 21:16:58 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 16 Mar 2016 18:16:58 -0700 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: Hi, On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris wrote: > > > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris > wrote: >> >> >> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett >> wrote: >>> >>> Hi, >>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers >>> wrote: >>> > >>> > >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >>> > >>> > wrote: >>> >> >>> >> Hi, >>> >> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy >>> >> 0.17.0: >>> >> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >>> >> >>> >> This turns out to be due to small changes in the result of >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform >>> >> - the travis Ubuntu in this case. >>> >> >>> >> Is that expected? I haven't investigated deeply, but I believe the >>> >> 0.16.1 result (back into the depths of time) is slightly more >>> >> accurate. Is that possible? >>> > >>> > >>> > That's very well possible, because we changed the underlying LAPACK >>> > routine >>> > that's used: >>> > >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements. >>> > This PR also had some lstsq precision issues, so could be useful: >>> > https://github.com/scipy/scipy/pull/5550 >>> > >>> > What is the accuracy difference in your case? >>> >>> Damn - I thought you might ask me that! >>> >>> The test where it showed up comes from some calculations on a simple >>> regression problem. >>> >>> The regression model is $y = X B + e$, with ordinary least squares fit >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. >>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 >>> pinv, compared to scipy 0.17 pinv: >>> >>> # SSE using high-precision sympy pinv >>> 72.0232781366615313 >>> # SSE using scipy 0.16 pinv >>> 72.0232781366625261 >>> # SSE using scipy 0.17 pinv >>> 72.0232781390480170 >> >> >> That difference looks pretty big, looks almost like a cutoff effect. What >> is the matrix X, in particular, what is its condition number or singular >> values? Might also be interesting to see residuals for the different >> functions. >> >> >> > > Note that this might be an indication that the result of the calculation has > an intrinsically large error bar. In [6]: np.linalg.svd(X, compute_uv=False) Out[6]: array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, 1.72675581e-03, 9.51576795e-04]) In [7]: np.linalg.cond(X) Out[7]: 73278544935.300507 I wasn't sure what you meant about residuals from different functions - can you explain more? Cheers, Matthew From matthew.brett at gmail.com Wed Mar 16 22:13:18 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 16 Mar 2016 19:13:18 -0700 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Wed, Mar 16, 2016 at 6:16 PM, Matthew Brett wrote: > Hi, > > On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris > wrote: >> >> >> On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris >> wrote: >>> >>> >>> >>> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett >>> wrote: >>>> >>>> Hi, >>>> >>>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers >>>> wrote: >>>> > >>>> > >>>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >>>> > >>>> > wrote: >>>> >> >>>> >> Hi, >>>> >> >>>> >> I just noticed that our (neuroimaging) test suite fails for scipy >>>> >> 0.17.0: >>>> >> >>>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >>>> >> >>>> >> This turns out to be due to small changes in the result of >>>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform >>>> >> - the travis Ubuntu in this case. >>>> >> >>>> >> Is that expected? I haven't investigated deeply, but I believe the >>>> >> 0.16.1 result (back into the depths of time) is slightly more >>>> >> accurate. Is that possible? >>>> > >>>> > >>>> > That's very well possible, because we changed the underlying LAPACK >>>> > routine >>>> > that's used: >>>> > >>>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements. >>>> > This PR also had some lstsq precision issues, so could be useful: >>>> > https://github.com/scipy/scipy/pull/5550 >>>> > >>>> > What is the accuracy difference in your case? >>>> >>>> Damn - I thought you might ask me that! >>>> >>>> The test where it showed up comes from some calculations on a simple >>>> regression problem. >>>> >>>> The regression model is $y = X B + e$, with ordinary least squares fit >>>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. >>>> >>>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 >>>> pinv, compared to scipy 0.17 pinv: >>>> >>>> # SSE using high-precision sympy pinv >>>> 72.0232781366615313 >>>> # SSE using scipy 0.16 pinv >>>> 72.0232781366625261 >>>> # SSE using scipy 0.17 pinv >>>> 72.0232781390480170 >>> >>> >>> That difference looks pretty big, looks almost like a cutoff effect. What >>> is the matrix X, in particular, what is its condition number or singular >>> values? Might also be interesting to see residuals for the different >>> functions. >>> >>> >>> >> >> Note that this might be an indication that the result of the calculation has >> an intrinsically large error bar. > > In [6]: np.linalg.svd(X, compute_uv=False) > Out[6]: > array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, > 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, > 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, > 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, > 1.72675581e-03, 9.51576795e-04]) > > In [7]: np.linalg.cond(X) > Out[7]: 73278544935.300507 Although, the differences in residual error are at the 12th decimal place comparing the sum of squared error for a full-rank vs not-full-rank matrix: In [2]: import scipy.linalg as spl In [3]: X_redundant = np.c_[X, X[:, 0]] In [5]: np.linalg.cond(X_redundant) Out[5]: 4.6338774550388212e+23 In [8]: print_scalar(sse(X, spl.pinv(X), y)) 72.0232781366634640 In [9]: print_scalar(sse(X_redundant, spl.pinv(X_redundant), y)) 72.0232781366614461 Cheers, Matthew From charlesr.harris at gmail.com Wed Mar 16 23:01:35 2016 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 16 Mar 2016 21:01:35 -0600 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett wrote: > Hi, > > On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris > wrote: > > > > > > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris > > wrote: > >> > >> > >> > >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett > > >> wrote: > >>> > >>> Hi, > >>> > >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers > >>> wrote: > >>> > > >>> > > >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett > >>> > > >>> > wrote: > >>> >> > >>> >> Hi, > >>> >> > >>> >> I just noticed that our (neuroimaging) test suite fails for scipy > >>> >> 0.17.0: > >>> >> > >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 > >>> >> > >>> >> This turns out to be due to small changes in the result of > >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same > platform > >>> >> - the travis Ubuntu in this case. > >>> >> > >>> >> Is that expected? I haven't investigated deeply, but I believe the > >>> >> 0.16.1 result (back into the depths of time) is slightly more > >>> >> accurate. Is that possible? > >>> > > >>> > > >>> > That's very well possible, because we changed the underlying LAPACK > >>> > routine > >>> > that's used: > >>> > > >>> > > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements > . > >>> > This PR also had some lstsq precision issues, so could be useful: > >>> > https://github.com/scipy/scipy/pull/5550 > >>> > > >>> > What is the accuracy difference in your case? > >>> > >>> Damn - I thought you might ask me that! > >>> > >>> The test where it showed up comes from some calculations on a simple > >>> regression problem. > >>> > >>> The regression model is $y = X B + e$, with ordinary least squares fit > >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. > >>> > >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 > >>> pinv, compared to scipy 0.17 pinv: > >>> > >>> # SSE using high-precision sympy pinv > >>> 72.0232781366615313 > >>> # SSE using scipy 0.16 pinv > >>> 72.0232781366625261 > >>> # SSE using scipy 0.17 pinv > >>> 72.0232781390480170 > >> > >> > >> That difference looks pretty big, looks almost like a cutoff effect. > What > >> is the matrix X, in particular, what is its condition number or singular > >> values? Might also be interesting to see residuals for the different > >> functions. > >> > >> > >> > > > > Note that this might be an indication that the result of the calculation > has > > an intrinsically large error bar. > > In [6]: np.linalg.svd(X, compute_uv=False) > Out[6]: > array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, > 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, > 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, > 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, > 1.72675581e-03, 9.51576795e-04]) > > In [7]: np.linalg.cond(X) > Out[7]: 73278544935.300507 > Condition number is about 7e10, so you might expect to lose about 10 digits of of accuracy just due to roundoff. The covariance of the fitting error will be `inv(X.T, X) * error`, so you can look at the diagonals for the variance of the fitted parameters just for kicks, taking `error` to be roundoff error or measurement error. This is not to say that the results should not be more reproducible, but should be of interest for the validity of the solution. Note that I suspect that there may be a difference in the default `rcond` used, or maybe the matrix is on the edge and a small change in results changes the effective rank. You can run with the `return_rank=True` argument to check that. Looks like machine precision is used for the default value of rcond, so experimenting with that probable won't be informative, but you might want to try it anyway. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Wed Mar 16 23:39:29 2016 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 16 Mar 2016 23:39:29 -0400 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris < charlesr.harris at gmail.com> wrote: > > > On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett > wrote: > >> Hi, >> >> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris >> wrote: >> > >> > >> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris >> > wrote: >> >> >> >> >> >> >> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett < >> matthew.brett at gmail.com> >> >> wrote: >> >>> >> >>> Hi, >> >>> >> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers > > >> >>> wrote: >> >>> > >> >>> > >> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >> >>> > >> >>> > wrote: >> >>> >> >> >>> >> Hi, >> >>> >> >> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy >> >>> >> 0.17.0: >> >>> >> >> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >> >>> >> >> >>> >> This turns out to be due to small changes in the result of >> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same >> platform >> >>> >> - the travis Ubuntu in this case. >> >>> >> >> >>> >> Is that expected? I haven't investigated deeply, but I believe >> the >> >>> >> 0.16.1 result (back into the depths of time) is slightly more >> >>> >> accurate. Is that possible? >> >>> > >> >>> > >> >>> > That's very well possible, because we changed the underlying LAPACK >> >>> > routine >> >>> > that's used: >> >>> > >> >>> > >> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements >> . >> >>> > This PR also had some lstsq precision issues, so could be useful: >> >>> > https://github.com/scipy/scipy/pull/5550 >> >>> > >> >>> > What is the accuracy difference in your case? >> >>> >> >>> Damn - I thought you might ask me that! >> >>> >> >>> The test where it showed up comes from some calculations on a simple >> >>> regression problem. >> >>> >> >>> The regression model is $y = X B + e$, with ordinary least squares fit >> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. >> >>> >> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 >> >>> pinv, compared to scipy 0.17 pinv: >> >>> >> >>> # SSE using high-precision sympy pinv >> >>> 72.0232781366615313 >> >>> # SSE using scipy 0.16 pinv >> >>> 72.0232781366625261 >> >>> # SSE using scipy 0.17 pinv >> >>> 72.0232781390480170 >> >> >> >> >> >> That difference looks pretty big, looks almost like a cutoff effect. >> What >> >> is the matrix X, in particular, what is its condition number or >> singular >> >> values? Might also be interesting to see residuals for the different >> >> functions. >> >> >> >> >> >> >> > >> > Note that this might be an indication that the result of the >> calculation has >> > an intrinsically large error bar. >> >> In [6]: np.linalg.svd(X, compute_uv=False) >> Out[6]: >> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, >> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, >> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, >> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, >> 1.72675581e-03, 9.51576795e-04]) >> >> In [7]: np.linalg.cond(X) >> Out[7]: 73278544935.300507 >> > > Condition number is about 7e10, so you might expect to lose about 10 > digits of of accuracy just due to roundoff. The covariance of the fitting > error will be `inv(X.T, X) * error`, so you can look at the diagonals for > the variance of the fitted parameters just for kicks, taking `error` to be > roundoff error or measurement error. This is not to say that the results > should not be more reproducible, but should be of interest for the validity > of the solution. Note that I suspect that there may be a difference in the > default `rcond` used, or maybe the matrix is on the edge and a small change > in results changes the effective rank. You can run with the > `return_rank=True` argument to check that. Looks like machine precision is > used for the default value of rcond, so experimenting with that probable > won't be informative, but you might want to try it anyway. > statsmodels is using numpy pinv for the linear model and it works in almost all cases very well *). Using pinv with singular (given the rcond threshold) X makes everything well defined and not subject to numerical noise because of the regularization induced by pinv. Our covariance for the parameter estimates, based on pinv(X), is also not noisy but it's of reduced rank. The main problem I have seen so far is that the default rcond is too small for some cases, and it breaks down with near singular X with rcond a bit larger than 1e-15 in an example with dummy variables. https://github.com/statsmodels/statsmodels/issues/2628 http://stackoverflow.com/questions/32599159/robustness-issue-of-statsmodel-linear-regression-ols-python If I think a bit longer, then I remember two other cases where pinv with ill conditioned X also doesn't work very well with continuous almost collinear data, one NIST example with badly scaled polynomials where precision goes down to a few decimals and a very small economic dataset where the normal equations (or orthogonality conditions?) are not satisfied at a good precision.. Josef > > Chuck > > > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Mar 16 23:49:29 2016 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 16 Mar 2016 21:49:29 -0600 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Wed, Mar 16, 2016 at 9:39 PM, wrote: > > > On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett >> wrote: >> >>> Hi, >>> >>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris >>> wrote: >>> > >>> > >>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris >>> > wrote: >>> >> >>> >> >>> >> >>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett < >>> matthew.brett at gmail.com> >>> >> wrote: >>> >>> >>> >>> Hi, >>> >>> >>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers < >>> ralf.gommers at gmail.com> >>> >>> wrote: >>> >>> > >>> >>> > >>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >>> >>> > >>> >>> > wrote: >>> >>> >> >>> >>> >> Hi, >>> >>> >> >>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy >>> >>> >> 0.17.0: >>> >>> >> >>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >>> >>> >> >>> >>> >> This turns out to be due to small changes in the result of >>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same >>> platform >>> >>> >> - the travis Ubuntu in this case. >>> >>> >> >>> >>> >> Is that expected? I haven't investigated deeply, but I believe >>> the >>> >>> >> 0.16.1 result (back into the depths of time) is slightly more >>> >>> >> accurate. Is that possible? >>> >>> > >>> >>> > >>> >>> > That's very well possible, because we changed the underlying LAPACK >>> >>> > routine >>> >>> > that's used: >>> >>> > >>> >>> > >>> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements >>> . >>> >>> > This PR also had some lstsq precision issues, so could be useful: >>> >>> > https://github.com/scipy/scipy/pull/5550 >>> >>> > >>> >>> > What is the accuracy difference in your case? >>> >>> >>> >>> Damn - I thought you might ask me that! >>> >>> >>> >>> The test where it showed up comes from some calculations on a simple >>> >>> regression problem. >>> >>> >>> >>> The regression model is $y = X B + e$, with ordinary least squares >>> fit >>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. >>> >>> >>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 >>> >>> pinv, compared to scipy 0.17 pinv: >>> >>> >>> >>> # SSE using high-precision sympy pinv >>> >>> 72.0232781366615313 >>> >>> # SSE using scipy 0.16 pinv >>> >>> 72.0232781366625261 >>> >>> # SSE using scipy 0.17 pinv >>> >>> 72.0232781390480170 >>> >> >>> >> >>> >> That difference looks pretty big, looks almost like a cutoff effect. >>> What >>> >> is the matrix X, in particular, what is its condition number or >>> singular >>> >> values? Might also be interesting to see residuals for the different >>> >> functions. >>> >> >>> >> >>> >> >>> > >>> > Note that this might be an indication that the result of the >>> calculation has >>> > an intrinsically large error bar. >>> >>> In [6]: np.linalg.svd(X, compute_uv=False) >>> Out[6]: >>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, >>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, >>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, >>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, >>> 1.72675581e-03, 9.51576795e-04]) >>> >>> In [7]: np.linalg.cond(X) >>> Out[7]: 73278544935.300507 >>> >> >> Condition number is about 7e10, so you might expect to lose about 10 >> digits of of accuracy just due to roundoff. The covariance of the fitting >> error will be `inv(X.T, X) * error`, so you can look at the diagonals for >> the variance of the fitted parameters just for kicks, taking `error` to be >> roundoff error or measurement error. This is not to say that the results >> should not be more reproducible, but should be of interest for the validity >> of the solution. Note that I suspect that there may be a difference in the >> default `rcond` used, or maybe the matrix is on the edge and a small change >> in results changes the effective rank. You can run with the >> `return_rank=True` argument to check that. Looks like machine precision is >> used for the default value of rcond, so experimenting with that probable >> won't be informative, but you might want to try it anyway. >> > > > statsmodels is using numpy pinv for the linear model and it works in > almost all cases very well *). > > Using pinv with singular (given the rcond threshold) X makes everything > well defined and not subject to numerical noise because of the > regularization induced by pinv. Our covariance for the parameter > estimates, based on pinv(X), is also not noisy but it's of reduced rank. > What I'm thinking is that changing the algorithm, as here, is like introducing noise relative to the original, so even if the result is perfectly valid, the fitted parameters might change significantly. A change in the 11-th digit wouldn't surprise me at all, depending on the actual X and rhs. > > The main problem I have seen so far is that the default rcond is too small > for some cases, and it breaks down with near singular X with rcond a bit > larger than 1e-15 in an example with dummy variables. > > https://github.com/statsmodels/statsmodels/issues/2628 > > http://stackoverflow.com/questions/32599159/robustness-issue-of-statsmodel-linear-regression-ols-python > > > If I think a bit longer, then I remember two other cases where pinv with > ill conditioned X also doesn't work very well with continuous almost > collinear data, one NIST example with badly scaled polynomials where > precision goes down to a few decimals and a very small economic dataset > where the normal equations (or orthogonality conditions?) are not satisfied > at a good precision.. > > Yeah, it can make a real mess with polynomial fits ;) Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Mar 16 23:57:22 2016 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 16 Mar 2016 21:57:22 -0600 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Wed, Mar 16, 2016 at 9:49 PM, Charles R Harris wrote: > > > On Wed, Mar 16, 2016 at 9:39 PM, wrote: > >> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett >>> wrote: >>> >>>> Hi, >>>> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris >>>> wrote: >>>> > >>>> > >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris >>>> > wrote: >>>> >> >>>> >> >>>> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett < >>>> matthew.brett at gmail.com> >>>> >> wrote: >>>> >>> >>>> >>> Hi, >>>> >>> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers < >>>> ralf.gommers at gmail.com> >>>> >>> wrote: >>>> >>> > >>>> >>> > >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >>>> >>> > >>>> >>> > wrote: >>>> >>> >> >>>> >>> >> Hi, >>>> >>> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy >>>> >>> >> 0.17.0: >>>> >>> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >>>> >>> >> >>>> >>> >> This turns out to be due to small changes in the result of >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same >>>> platform >>>> >>> >> - the travis Ubuntu in this case. >>>> >>> >> >>>> >>> >> Is that expected? I haven't investigated deeply, but I believe >>>> the >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more >>>> >>> >> accurate. Is that possible? >>>> >>> > >>>> >>> > >>>> >>> > That's very well possible, because we changed the underlying >>>> LAPACK >>>> >>> > routine >>>> >>> > that's used: >>>> >>> > >>>> >>> > >>>> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements >>>> . >>>> >>> > This PR also had some lstsq precision issues, so could be useful: >>>> >>> > https://github.com/scipy/scipy/pull/5550 >>>> >>> > >>>> >>> > What is the accuracy difference in your case? >>>> >>> >>>> >>> Damn - I thought you might ask me that! >>>> >>> >>>> >>> The test where it showed up comes from some calculations on a simple >>>> >>> regression problem. >>>> >>> >>>> >>> The regression model is $y = X B + e$, with ordinary least squares >>>> fit >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. >>>> >>> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 >>>> >>> pinv, compared to scipy 0.17 pinv: >>>> >>> >>>> >>> # SSE using high-precision sympy pinv >>>> >>> 72.0232781366615313 >>>> >>> # SSE using scipy 0.16 pinv >>>> >>> 72.0232781366625261 >>>> >>> # SSE using scipy 0.17 pinv >>>> >>> 72.0232781390480170 >>>> >> >>>> >> >>>> >> That difference looks pretty big, looks almost like a cutoff effect. >>>> What >>>> >> is the matrix X, in particular, what is its condition number or >>>> singular >>>> >> values? Might also be interesting to see residuals for the different >>>> >> functions. >>>> >> >>>> >> >>>> >> >>>> > >>>> > Note that this might be an indication that the result of the >>>> calculation has >>>> > an intrinsically large error bar. >>>> >>>> In [6]: np.linalg.svd(X, compute_uv=False) >>>> Out[6]: >>>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, >>>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, >>>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, >>>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, >>>> 1.72675581e-03, 9.51576795e-04]) >>>> >>>> In [7]: np.linalg.cond(X) >>>> Out[7]: 73278544935.300507 >>>> >>> >>> Condition number is about 7e10, so you might expect to lose about 10 >>> digits of of accuracy just due to roundoff. The covariance of the fitting >>> error will be `inv(X.T, X) * error`, so you can look at the diagonals for >>> the variance of the fitted parameters just for kicks, taking `error` to be >>> roundoff error or measurement error. This is not to say that the results >>> should not be more reproducible, but should be of interest for the validity >>> of the solution. Note that I suspect that there may be a difference in the >>> default `rcond` used, or maybe the matrix is on the edge and a small change >>> in results changes the effective rank. You can run with the >>> `return_rank=True` argument to check that. Looks like machine precision is >>> used for the default value of rcond, so experimenting with that probable >>> won't be informative, but you might want to try it anyway. >>> >> >> >> statsmodels is using numpy pinv for the linear model and it works in >> almost all cases very well *). >> >> Using pinv with singular (given the rcond threshold) X makes everything >> well defined and not subject to numerical noise because of the >> regularization induced by pinv. Our covariance for the parameter >> estimates, based on pinv(X), is also not noisy but it's of reduced rank. >> > > What I'm thinking is that changing the algorithm, as here, is like > introducing noise relative to the original, so even if the result is > perfectly valid, the fitted parameters might change significantly. A change > in the 11-th digit wouldn't surprise me at all, depending on the actual X > and rhs. > > >> >> The main problem I have seen so far is that the default rcond is too >> small for some cases, and it breaks down with near singular X with rcond a >> bit larger than 1e-15 in an example with dummy variables. >> >> https://github.com/statsmodels/statsmodels/issues/2628 >> >> http://stackoverflow.com/questions/32599159/robustness-issue-of-statsmodel-linear-regression-ols-python >> >> >> If I think a bit longer, then I remember two other cases where pinv with >> ill conditioned X also doesn't work very well with continuous almost >> collinear data, one NIST example with badly scaled polynomials where >> precision goes down to a few decimals and a very small economic dataset >> where the normal equations (or orthogonality conditions?) are not satisfied >> at a good precision.. >> >> > Yeah, it can make a real mess with polynomial fits ;) > The classic rant on these matters is in Numerical Methods that Work. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Thu Mar 17 03:19:37 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 17 Mar 2016 00:19:37 -0700 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: Hi, On Wed, Mar 16, 2016 at 8:39 PM, wrote: > > > On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris > wrote: >> >> >> >> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett >> wrote: >>> >>> Hi, >>> >>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris >>> wrote: >>> > >>> > >>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris >>> > wrote: >>> >> >>> >> >>> >> >>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett >>> >> >>> >> wrote: >>> >>> >>> >>> Hi, >>> >>> >>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers >>> >>> >>> >>> wrote: >>> >>> > >>> >>> > >>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >>> >>> > >>> >>> > wrote: >>> >>> >> >>> >>> >> Hi, >>> >>> >> >>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy >>> >>> >> 0.17.0: >>> >>> >> >>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >>> >>> >> >>> >>> >> This turns out to be due to small changes in the result of >>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same >>> >>> >> platform >>> >>> >> - the travis Ubuntu in this case. >>> >>> >> >>> >>> >> Is that expected? I haven't investigated deeply, but I believe >>> >>> >> the >>> >>> >> 0.16.1 result (back into the depths of time) is slightly more >>> >>> >> accurate. Is that possible? >>> >>> > >>> >>> > >>> >>> > That's very well possible, because we changed the underlying LAPACK >>> >>> > routine >>> >>> > that's used: >>> >>> > >>> >>> > >>> >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements. >>> >>> > This PR also had some lstsq precision issues, so could be useful: >>> >>> > https://github.com/scipy/scipy/pull/5550 >>> >>> > >>> >>> > What is the accuracy difference in your case? >>> >>> >>> >>> Damn - I thought you might ask me that! >>> >>> >>> >>> The test where it showed up comes from some calculations on a simple >>> >>> regression problem. >>> >>> >>> >>> The regression model is $y = X B + e$, with ordinary least squares >>> >>> fit >>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. >>> >>> >>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 >>> >>> pinv, compared to scipy 0.17 pinv: >>> >>> >>> >>> # SSE using high-precision sympy pinv >>> >>> 72.0232781366615313 >>> >>> # SSE using scipy 0.16 pinv >>> >>> 72.0232781366625261 >>> >>> # SSE using scipy 0.17 pinv >>> >>> 72.0232781390480170 >>> >> >>> >> >>> >> That difference looks pretty big, looks almost like a cutoff effect. >>> >> What >>> >> is the matrix X, in particular, what is its condition number or >>> >> singular >>> >> values? Might also be interesting to see residuals for the different >>> >> functions. >>> >> >>> >> >>> >> >>> > >>> > Note that this might be an indication that the result of the >>> > calculation has >>> > an intrinsically large error bar. >>> >>> In [6]: np.linalg.svd(X, compute_uv=False) >>> Out[6]: >>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, >>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, >>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, >>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, >>> 1.72675581e-03, 9.51576795e-04]) >>> >>> In [7]: np.linalg.cond(X) >>> Out[7]: 73278544935.300507 >> >> >> Condition number is about 7e10, so you might expect to lose about 10 >> digits of of accuracy just due to roundoff. The covariance of the fitting >> error will be `inv(X.T, X) * error`, so you can look at the diagonals for >> the variance of the fitted parameters just for kicks, taking `error` to be >> roundoff error or measurement error. This is not to say that the results >> should not be more reproducible, but should be of interest for the validity >> of the solution. Note that I suspect that there may be a difference in the >> default `rcond` used, or maybe the matrix is on the edge and a small change >> in results changes the effective rank. You can run with the >> `return_rank=True` argument to check that. Looks like machine precision is >> used for the default value of rcond, so experimenting with that probable >> won't be informative, but you might want to try it anyway. > > > > statsmodels is using numpy pinv for the linear model and it works in almost > all cases very well *). Yes, the error for numpy.linalg pinv is a lot less: Sympy high-precision pinv (as before): 72.0232781366602524 Scipy 0.16 pinv (as before, difference at 12th dp): 72.0232781366620856 Scipy 0.17 pinv (as before, difference at 9th dp): 72.0232781390479602 Numpy pinv (difference at 12th dp) In [2]: print_scalar(sse(X, np.linalg.pinv(X), y)) 72.0232781366618440 So numpy pinv is more accurate than scipy pinv for this matrix, for scipy 0.17, and both scipy 0.16 and numpy differ at the 12th dp, as does estimation for a singular matrix (previous email). Incidentally, this is all on scipy / openblas / linux. On OSX: Scipy 0.17.0: 72.0232781366634640 Numpy: 72.0232781366609771 Scipy 0.16.0 72.0232781366634640 So - on OSX - differences also predictably at 12th dp for both scipy versions and numpy. It does seem that scipy 0.17 on Linux is the outlier. Matthew From olivier.grisel at ensta.org Thu Mar 17 05:30:24 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 17 Mar 2016 10:30:24 +0100 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: Can you check with Atlas on linux then? -- Olivier From matthew.brett at gmail.com Thu Mar 17 15:30:28 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 17 Mar 2016 12:30:28 -0700 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Thu, Mar 17, 2016 at 2:30 AM, Olivier Grisel wrote: > Can you check with Atlas on linux then? Yes, same idea, this on another computer, Debian packaged ATLAS: Sympy high-precision pinv SSE 72.0232781366615313 Local numpy 1.10.4 linalg pinv 72.0232781366614176 Local scipy 0.16.0 linalg pinv 72.0232781366600392 Local scipy 0.17.0 linalg pinv 72.0232781368149233 Difference at 10th dp, but still two orders of magnitude different from the other variations. https://github.com/matthew-brett/scipy-pinv/blob/master/pinv_futz.py Matthew From matthew.brett at gmail.com Fri Mar 18 15:07:39 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 18 Mar 2016 12:07:39 -0700 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris wrote: > > > On Wed, Mar 16, 2016 at 9:39 PM, wrote: >> >> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris >> wrote: >>> >>> >>> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett >>> wrote: >>>> >>>> Hi, >>>> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris >>>> wrote: >>>> > >>>> > >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris >>>> > wrote: >>>> >> >>>> >> >>>> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett >>>> >> >>>> >> wrote: >>>> >>> >>>> >>> Hi, >>>> >>> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers >>>> >>> >>>> >>> wrote: >>>> >>> > >>>> >>> > >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >>>> >>> > >>>> >>> > wrote: >>>> >>> >> >>>> >>> >> Hi, >>>> >>> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy >>>> >>> >> 0.17.0: >>>> >>> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >>>> >>> >> >>>> >>> >> This turns out to be due to small changes in the result of >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same >>>> >>> >> platform >>>> >>> >> - the travis Ubuntu in this case. >>>> >>> >> >>>> >>> >> Is that expected? I haven't investigated deeply, but I believe >>>> >>> >> the >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more >>>> >>> >> accurate. Is that possible? >>>> >>> > >>>> >>> > >>>> >>> > That's very well possible, because we changed the underlying >>>> >>> > LAPACK >>>> >>> > routine >>>> >>> > that's used: >>>> >>> > >>>> >>> > >>>> >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements. >>>> >>> > This PR also had some lstsq precision issues, so could be useful: >>>> >>> > https://github.com/scipy/scipy/pull/5550 >>>> >>> > >>>> >>> > What is the accuracy difference in your case? >>>> >>> >>>> >>> Damn - I thought you might ask me that! >>>> >>> >>>> >>> The test where it showed up comes from some calculations on a simple >>>> >>> regression problem. >>>> >>> >>>> >>> The regression model is $y = X B + e$, with ordinary least squares >>>> >>> fit >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. >>>> >>> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16 >>>> >>> pinv, compared to scipy 0.17 pinv: >>>> >>> >>>> >>> # SSE using high-precision sympy pinv >>>> >>> 72.0232781366615313 >>>> >>> # SSE using scipy 0.16 pinv >>>> >>> 72.0232781366625261 >>>> >>> # SSE using scipy 0.17 pinv >>>> >>> 72.0232781390480170 >>>> >> >>>> >> >>>> >> That difference looks pretty big, looks almost like a cutoff effect. >>>> >> What >>>> >> is the matrix X, in particular, what is its condition number or >>>> >> singular >>>> >> values? Might also be interesting to see residuals for the different >>>> >> functions. >>>> >> >>>> >> >>>> >> >>>> > >>>> > Note that this might be an indication that the result of the >>>> > calculation has >>>> > an intrinsically large error bar. >>>> >>>> In [6]: np.linalg.svd(X, compute_uv=False) >>>> Out[6]: >>>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, >>>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, >>>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, >>>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, >>>> 1.72675581e-03, 9.51576795e-04]) >>>> >>>> In [7]: np.linalg.cond(X) >>>> Out[7]: 73278544935.300507 >>> >>> >>> Condition number is about 7e10, so you might expect to lose about 10 >>> digits of of accuracy just due to roundoff. The covariance of the fitting >>> error will be `inv(X.T, X) * error`, so you can look at the diagonals for >>> the variance of the fitted parameters just for kicks, taking `error` to be >>> roundoff error or measurement error. This is not to say that the results >>> should not be more reproducible, but should be of interest for the validity >>> of the solution. Note that I suspect that there may be a difference in the >>> default `rcond` used, or maybe the matrix is on the edge and a small change >>> in results changes the effective rank. You can run with the >>> `return_rank=True` argument to check that. Looks like machine precision is >>> used for the default value of rcond, so experimenting with that probable >>> won't be informative, but you might want to try it anyway. >> >> >> >> statsmodels is using numpy pinv for the linear model and it works in >> almost all cases very well *). >> >> Using pinv with singular (given the rcond threshold) X makes everything >> well defined and not subject to numerical noise because of the >> regularization induced by pinv. Our covariance for the parameter estimates, >> based on pinv(X), is also not noisy but it's of reduced rank. > > > What I'm thinking is that changing the algorithm, as here, is like > introducing noise relative to the original, so even if the result is > perfectly valid, the fitted parameters might change significantly. A change > in the 11-th digit wouldn't surprise me at all, depending on the actual X > and rhs. I believe the errors $e = y - X^+ y$ are in general robust to differences between valid pinv solutions, even in extreme cases: In [2]: print_scalar(sse(X, npl.pinv(X), y)) 72.0232781366616024 In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y)) 72.0232781366939463 In [4]: npl.cond(np.c_[X, X]) Out[4]: 6.6247965335098062e+26 Cheers, Matthew From josef.pktd at gmail.com Fri Mar 18 15:19:38 2016 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 18 Mar 2016 15:19:38 -0400 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett wrote: > On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris > wrote: > > > > > > On Wed, Mar 16, 2016 at 9:39 PM, wrote: > >> > >> > >> > >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris > >> wrote: > >>> > >>> > >>> > >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett < > matthew.brett at gmail.com> > >>> wrote: > >>>> > >>>> Hi, > >>>> > >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris > >>>> wrote: > >>>> > > >>>> > > >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris > >>>> > wrote: > >>>> >> > >>>> >> > >>>> >> > >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett > >>>> >> > >>>> >> wrote: > >>>> >>> > >>>> >>> Hi, > >>>> >>> > >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers > >>>> >>> > >>>> >>> wrote: > >>>> >>> > > >>>> >>> > > >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett > >>>> >>> > > >>>> >>> > wrote: > >>>> >>> >> > >>>> >>> >> Hi, > >>>> >>> >> > >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for > scipy > >>>> >>> >> 0.17.0: > >>>> >>> >> > >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 > >>>> >>> >> > >>>> >>> >> This turns out to be due to small changes in the result of > >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same > >>>> >>> >> platform > >>>> >>> >> - the travis Ubuntu in this case. > >>>> >>> >> > >>>> >>> >> Is that expected? I haven't investigated deeply, but I > believe > >>>> >>> >> the > >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more > >>>> >>> >> accurate. Is that possible? > >>>> >>> > > >>>> >>> > > >>>> >>> > That's very well possible, because we changed the underlying > >>>> >>> > LAPACK > >>>> >>> > routine > >>>> >>> > that's used: > >>>> >>> > > >>>> >>> > > >>>> >>> > > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements > . > >>>> >>> > This PR also had some lstsq precision issues, so could be > useful: > >>>> >>> > https://github.com/scipy/scipy/pull/5550 > >>>> >>> > > >>>> >>> > What is the accuracy difference in your case? > >>>> >>> > >>>> >>> Damn - I thought you might ask me that! > >>>> >>> > >>>> >>> The test where it showed up comes from some calculations on a > simple > >>>> >>> regression problem. > >>>> >>> > >>>> >>> The regression model is $y = X B + e$, with ordinary least squares > >>>> >>> fit > >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X. > >>>> >>> > >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy > 0.16 > >>>> >>> pinv, compared to scipy 0.17 pinv: > >>>> >>> > >>>> >>> # SSE using high-precision sympy pinv > >>>> >>> 72.0232781366615313 > >>>> >>> # SSE using scipy 0.16 pinv > >>>> >>> 72.0232781366625261 > >>>> >>> # SSE using scipy 0.17 pinv > >>>> >>> 72.0232781390480170 > >>>> >> > >>>> >> > >>>> >> That difference looks pretty big, looks almost like a cutoff > effect. > >>>> >> What > >>>> >> is the matrix X, in particular, what is its condition number or > >>>> >> singular > >>>> >> values? Might also be interesting to see residuals for the > different > >>>> >> functions. > >>>> >> > >>>> >> > >>>> >> > >>>> > > >>>> > Note that this might be an indication that the result of the > >>>> > calculation has > >>>> > an intrinsically large error bar. > >>>> > >>>> In [6]: np.linalg.svd(X, compute_uv=False) > >>>> Out[6]: > >>>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, > >>>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, > >>>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, > >>>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, > >>>> 1.72675581e-03, 9.51576795e-04]) > >>>> > >>>> In [7]: np.linalg.cond(X) > >>>> Out[7]: 73278544935.300507 > >>> > >>> > >>> Condition number is about 7e10, so you might expect to lose about 10 > >>> digits of of accuracy just due to roundoff. The covariance of the > fitting > >>> error will be `inv(X.T, X) * error`, so you can look at the diagonals > for > >>> the variance of the fitted parameters just for kicks, taking `error` > to be > >>> roundoff error or measurement error. This is not to say that the > results > >>> should not be more reproducible, but should be of interest for the > validity > >>> of the solution. Note that I suspect that there may be a difference > in the > >>> default `rcond` used, or maybe the matrix is on the edge and a small > change > >>> in results changes the effective rank. You can run with the > >>> `return_rank=True` argument to check that. Looks like machine > precision is > >>> used for the default value of rcond, so experimenting with that > probable > >>> won't be informative, but you might want to try it anyway. > >> > >> > >> > >> statsmodels is using numpy pinv for the linear model and it works in > >> almost all cases very well *). > >> > >> Using pinv with singular (given the rcond threshold) X makes everything > >> well defined and not subject to numerical noise because of the > >> regularization induced by pinv. Our covariance for the parameter > estimates, > >> based on pinv(X), is also not noisy but it's of reduced rank. > > > > > > What I'm thinking is that changing the algorithm, as here, is like > > introducing noise relative to the original, so even if the result is > > perfectly valid, the fitted parameters might change significantly. A > change > > in the 11-th digit wouldn't surprise me at all, depending on the actual X > > and rhs. > > I believe the errors $e = y - X^+ y$ are in general robust to > differences between valid pinv solutions, even in extreme cases: > > In [2]: print_scalar(sse(X, npl.pinv(X), y)) > 72.0232781366616024 > > In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y)) > 72.0232781366939463 > > In [4]: npl.cond(np.c_[X, X]) > Out[4]: 6.6247965335098062e+26 > I think this does not depend on the smallest eigenvalues which should be regularized away in cases like this. The potential problem is with eigenvalues that are just a but too large for rcond to remove them but still have numerical problems. I think it's a similar issue about the threshold as the one that you raised some time ago with determining the matrix rank for near singular matrices. Josef > > Cheers, > > Matthew > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Fri Mar 18 15:56:04 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 18 Mar 2016 12:56:04 -0700 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Fri, Mar 18, 2016 at 12:19 PM, wrote: > > > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett > wrote: >> >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris >> wrote: >> > >> > >> > On Wed, Mar 16, 2016 at 9:39 PM, wrote: >> >> >> >> >> >> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris >> >> wrote: >> >>> >> >>> >> >>> >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett >> >>> >> >>> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris >> >>>> wrote: >> >>>> > >> >>>> > >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris >> >>>> > wrote: >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett >> >>>> >> >> >>>> >> wrote: >> >>>> >>> >> >>>> >>> Hi, >> >>>> >>> >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers >> >>>> >>> >> >>>> >>> wrote: >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >> >>>> >>> > >> >>>> >>> > wrote: >> >>>> >>> >> >> >>>> >>> >> Hi, >> >>>> >>> >> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for >> >>>> >>> >> scipy >> >>>> >>> >> 0.17.0: >> >>>> >>> >> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >> >>>> >>> >> >> >>>> >>> >> This turns out to be due to small changes in the result of >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same >> >>>> >>> >> platform >> >>>> >>> >> - the travis Ubuntu in this case. >> >>>> >>> >> >> >>>> >>> >> Is that expected? I haven't investigated deeply, but I >> >>>> >>> >> believe >> >>>> >>> >> the >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more >> >>>> >>> >> accurate. Is that possible? >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > That's very well possible, because we changed the underlying >> >>>> >>> > LAPACK >> >>>> >>> > routine >> >>>> >>> > that's used: >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements. >> >>>> >>> > This PR also had some lstsq precision issues, so could be >> >>>> >>> > useful: >> >>>> >>> > https://github.com/scipy/scipy/pull/5550 >> >>>> >>> > >> >>>> >>> > What is the accuracy difference in your case? >> >>>> >>> >> >>>> >>> Damn - I thought you might ask me that! >> >>>> >>> >> >>>> >>> The test where it showed up comes from some calculations on a >> >>>> >>> simple >> >>>> >>> regression problem. >> >>>> >>> >> >>>> >>> The regression model is $y = X B + e$, with ordinary least >> >>>> >>> squares >> >>>> >>> fit >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix >> >>>> >>> X. >> >>>> >>> >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy >> >>>> >>> 0.16 >> >>>> >>> pinv, compared to scipy 0.17 pinv: >> >>>> >>> >> >>>> >>> # SSE using high-precision sympy pinv >> >>>> >>> 72.0232781366615313 >> >>>> >>> # SSE using scipy 0.16 pinv >> >>>> >>> 72.0232781366625261 >> >>>> >>> # SSE using scipy 0.17 pinv >> >>>> >>> 72.0232781390480170 >> >>>> >> >> >>>> >> >> >>>> >> That difference looks pretty big, looks almost like a cutoff >> >>>> >> effect. >> >>>> >> What >> >>>> >> is the matrix X, in particular, what is its condition number or >> >>>> >> singular >> >>>> >> values? Might also be interesting to see residuals for the >> >>>> >> different >> >>>> >> functions. >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> > >> >>>> > Note that this might be an indication that the result of the >> >>>> > calculation has >> >>>> > an intrinsically large error bar. >> >>>> >> >>>> In [6]: np.linalg.svd(X, compute_uv=False) >> >>>> Out[6]: >> >>>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, >> >>>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, >> >>>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, >> >>>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, >> >>>> 1.72675581e-03, 9.51576795e-04]) >> >>>> >> >>>> In [7]: np.linalg.cond(X) >> >>>> Out[7]: 73278544935.300507 >> >>> >> >>> >> >>> Condition number is about 7e10, so you might expect to lose about 10 >> >>> digits of of accuracy just due to roundoff. The covariance of the >> >>> fitting >> >>> error will be `inv(X.T, X) * error`, so you can look at the diagonals >> >>> for >> >>> the variance of the fitted parameters just for kicks, taking `error` >> >>> to be >> >>> roundoff error or measurement error. This is not to say that the >> >>> results >> >>> should not be more reproducible, but should be of interest for the >> >>> validity >> >>> of the solution. Note that I suspect that there may be a difference >> >>> in the >> >>> default `rcond` used, or maybe the matrix is on the edge and a small >> >>> change >> >>> in results changes the effective rank. You can run with the >> >>> `return_rank=True` argument to check that. Looks like machine >> >>> precision is >> >>> used for the default value of rcond, so experimenting with that >> >>> probable >> >>> won't be informative, but you might want to try it anyway. >> >> >> >> >> >> >> >> statsmodels is using numpy pinv for the linear model and it works in >> >> almost all cases very well *). >> >> >> >> Using pinv with singular (given the rcond threshold) X makes everything >> >> well defined and not subject to numerical noise because of the >> >> regularization induced by pinv. Our covariance for the parameter >> >> estimates, >> >> based on pinv(X), is also not noisy but it's of reduced rank. >> > >> > >> > What I'm thinking is that changing the algorithm, as here, is like >> > introducing noise relative to the original, so even if the result is >> > perfectly valid, the fitted parameters might change significantly. A >> > change >> > in the 11-th digit wouldn't surprise me at all, depending on the actual >> > X >> > and rhs. >> >> I believe the errors $e = y - X^+ y$ are in general robust to >> differences between valid pinv solutions, even in extreme cases: >> >> In [2]: print_scalar(sse(X, npl.pinv(X), y)) >> 72.0232781366616024 >> >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y)) >> 72.0232781366939463 >> >> In [4]: npl.cond(np.c_[X, X]) >> Out[4]: 6.6247965335098062e+26 > > > > I think this does not depend on the smallest eigenvalues which should be > regularized away in cases like this. > The potential problem is with eigenvalues that are just a but too large for > rcond to remove them but still have numerical problems. > > I think it's a similar issue about the threshold as the one that you raised > some time ago with determining the matrix rank for near singular matrices. My point was (I think you're agreeing) that the accuracy of the pinv does not depend in a simple way on the condition number of the input matrix. I'm not quite sure how to assess the change in accuracy of scipy 0.17 on Linux, but it does seem to me to be significantly different (and worse than) the other options (like numpy, scipy 0.16) Matthew From josef.pktd at gmail.com Fri Mar 18 16:09:30 2016 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 18 Mar 2016 16:09:30 -0400 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Fri, Mar 18, 2016 at 3:56 PM, Matthew Brett wrote: > On Fri, Mar 18, 2016 at 12:19 PM, wrote: > > > > > > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett > > wrote: > >> > >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris > >> wrote: > >> > > >> > > >> > On Wed, Mar 16, 2016 at 9:39 PM, wrote: > >> >> > >> >> > >> >> > >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris > >> >> wrote: > >> >>> > >> >>> > >> >>> > >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett > >> >>> > >> >>> wrote: > >> >>>> > >> >>>> Hi, > >> >>>> > >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris > >> >>>> wrote: > >> >>>> > > >> >>>> > > >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris > >> >>>> > wrote: > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett > >> >>>> >> > >> >>>> >> wrote: > >> >>>> >>> > >> >>>> >>> Hi, > >> >>>> >>> > >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers > >> >>>> >>> > >> >>>> >>> wrote: > >> >>>> >>> > > >> >>>> >>> > > >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett > >> >>>> >>> > > >> >>>> >>> > wrote: > >> >>>> >>> >> > >> >>>> >>> >> Hi, > >> >>>> >>> >> > >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for > >> >>>> >>> >> scipy > >> >>>> >>> >> 0.17.0: > >> >>>> >>> >> > >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 > >> >>>> >>> >> > >> >>>> >>> >> This turns out to be due to small changes in the result of > >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same > >> >>>> >>> >> platform > >> >>>> >>> >> - the travis Ubuntu in this case. > >> >>>> >>> >> > >> >>>> >>> >> Is that expected? I haven't investigated deeply, but I > >> >>>> >>> >> believe > >> >>>> >>> >> the > >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly > more > >> >>>> >>> >> accurate. Is that possible? > >> >>>> >>> > > >> >>>> >>> > > >> >>>> >>> > That's very well possible, because we changed the underlying > >> >>>> >>> > LAPACK > >> >>>> >>> > routine > >> >>>> >>> > that's used: > >> >>>> >>> > > >> >>>> >>> > > >> >>>> >>> > > >> >>>> >>> > > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements > . > >> >>>> >>> > This PR also had some lstsq precision issues, so could be > >> >>>> >>> > useful: > >> >>>> >>> > https://github.com/scipy/scipy/pull/5550 > >> >>>> >>> > > >> >>>> >>> > What is the accuracy difference in your case? > >> >>>> >>> > >> >>>> >>> Damn - I thought you might ask me that! > >> >>>> >>> > >> >>>> >>> The test where it showed up comes from some calculations on a > >> >>>> >>> simple > >> >>>> >>> regression problem. > >> >>>> >>> > >> >>>> >>> The regression model is $y = X B + e$, with ordinary least > >> >>>> >>> squares > >> >>>> >>> fit > >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix > >> >>>> >>> X. > >> >>>> >>> > >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy > >> >>>> >>> 0.16 > >> >>>> >>> pinv, compared to scipy 0.17 pinv: > >> >>>> >>> > >> >>>> >>> # SSE using high-precision sympy pinv > >> >>>> >>> 72.0232781366615313 > >> >>>> >>> # SSE using scipy 0.16 pinv > >> >>>> >>> 72.0232781366625261 > >> >>>> >>> # SSE using scipy 0.17 pinv > >> >>>> >>> 72.0232781390480170 > >> >>>> >> > >> >>>> >> > >> >>>> >> That difference looks pretty big, looks almost like a cutoff > >> >>>> >> effect. > >> >>>> >> What > >> >>>> >> is the matrix X, in particular, what is its condition number or > >> >>>> >> singular > >> >>>> >> values? Might also be interesting to see residuals for the > >> >>>> >> different > >> >>>> >> functions. > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> > > >> >>>> > Note that this might be an indication that the result of the > >> >>>> > calculation has > >> >>>> > an intrinsically large error bar. > >> >>>> > >> >>>> In [6]: np.linalg.svd(X, compute_uv=False) > >> >>>> Out[6]: > >> >>>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, > >> >>>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, > >> >>>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, > >> >>>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, > >> >>>> 1.72675581e-03, 9.51576795e-04]) > >> >>>> > >> >>>> In [7]: np.linalg.cond(X) > >> >>>> Out[7]: 73278544935.300507 > >> >>> > >> >>> > >> >>> Condition number is about 7e10, so you might expect to lose about 10 > >> >>> digits of of accuracy just due to roundoff. The covariance of the > >> >>> fitting > >> >>> error will be `inv(X.T, X) * error`, so you can look at the > diagonals > >> >>> for > >> >>> the variance of the fitted parameters just for kicks, taking `error` > >> >>> to be > >> >>> roundoff error or measurement error. This is not to say that the > >> >>> results > >> >>> should not be more reproducible, but should be of interest for the > >> >>> validity > >> >>> of the solution. Note that I suspect that there may be a difference > >> >>> in the > >> >>> default `rcond` used, or maybe the matrix is on the edge and a small > >> >>> change > >> >>> in results changes the effective rank. You can run with the > >> >>> `return_rank=True` argument to check that. Looks like machine > >> >>> precision is > >> >>> used for the default value of rcond, so experimenting with that > >> >>> probable > >> >>> won't be informative, but you might want to try it anyway. > >> >> > >> >> > >> >> > >> >> statsmodels is using numpy pinv for the linear model and it works in > >> >> almost all cases very well *). > >> >> > >> >> Using pinv with singular (given the rcond threshold) X makes > everything > >> >> well defined and not subject to numerical noise because of the > >> >> regularization induced by pinv. Our covariance for the parameter > >> >> estimates, > >> >> based on pinv(X), is also not noisy but it's of reduced rank. > >> > > >> > > >> > What I'm thinking is that changing the algorithm, as here, is like > >> > introducing noise relative to the original, so even if the result is > >> > perfectly valid, the fitted parameters might change significantly. A > >> > change > >> > in the 11-th digit wouldn't surprise me at all, depending on the > actual > >> > X > >> > and rhs. > >> > >> I believe the errors $e = y - X^+ y$ are in general robust to > >> differences between valid pinv solutions, even in extreme cases: > >> > >> In [2]: print_scalar(sse(X, npl.pinv(X), y)) > >> 72.0232781366616024 > >> > >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y)) > >> 72.0232781366939463 > >> > >> In [4]: npl.cond(np.c_[X, X]) > >> Out[4]: 6.6247965335098062e+26 > > > > > > > > I think this does not depend on the smallest eigenvalues which should be > > regularized away in cases like this. > > The potential problem is with eigenvalues that are just a but too large > for > > rcond to remove them but still have numerical problems. > > > > I think it's a similar issue about the threshold as the one that you > raised > > some time ago with determining the matrix rank for near singular > matrices. > > My point was (I think you're agreeing) that the accuracy of the pinv > does not depend in a simple way on the condition number of the input > matrix. > That's not the point I wanted to make. The condition number doesn't matter if there is an singular/eigenvalue with rcond 1e-16 or 1e-30 or 1e-100. However, the numerical accuracy might depend on the condition number when there is a singular value with rcond of 2e-15 or 5e-15. I didn't manage to find the default cond or rcond for scipy.linalg.pinv and lstsq. The problem I had was with numpy's pinv. I think there is a grey zone in rcond of eigenvalues that might be bad numerically. (My only experience is with SVD based pinv.) Josef > > I'm not quite sure how to assess the change in accuracy of scipy 0.17 > on Linux, but it does seem to me to be significantly different (and > worse than) the other options (like numpy, scipy 0.16) > > Matthew > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Fri Mar 18 16:19:58 2016 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 18 Mar 2016 16:19:58 -0400 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Fri, Mar 18, 2016 at 4:09 PM, wrote: > > > On Fri, Mar 18, 2016 at 3:56 PM, Matthew Brett > wrote: > >> On Fri, Mar 18, 2016 at 12:19 PM, wrote: >> > >> > >> > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett > > >> > wrote: >> >> >> >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris >> >> wrote: >> >> > >> >> > >> >> > On Wed, Mar 16, 2016 at 9:39 PM, wrote: >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris >> >> >> wrote: >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett >> >> >>> >> >> >>> wrote: >> >> >>>> >> >> >>>> Hi, >> >> >>>> >> >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris >> >> >>>> wrote: >> >> >>>> > >> >> >>>> > >> >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris >> >> >>>> > wrote: >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett >> >> >>>> >> >> >> >>>> >> wrote: >> >> >>>> >>> >> >> >>>> >>> Hi, >> >> >>>> >>> >> >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers >> >> >>>> >>> >> >> >>>> >>> wrote: >> >> >>>> >>> > >> >> >>>> >>> > >> >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >> >> >>>> >>> > >> >> >>>> >>> > wrote: >> >> >>>> >>> >> >> >> >>>> >>> >> Hi, >> >> >>>> >>> >> >> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for >> >> >>>> >>> >> scipy >> >> >>>> >>> >> 0.17.0: >> >> >>>> >>> >> >> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >> >> >>>> >>> >> >> >> >>>> >>> >> This turns out to be due to small changes in the result of >> >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the >> same >> >> >>>> >>> >> platform >> >> >>>> >>> >> - the travis Ubuntu in this case. >> >> >>>> >>> >> >> >> >>>> >>> >> Is that expected? I haven't investigated deeply, but I >> >> >>>> >>> >> believe >> >> >>>> >>> >> the >> >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly >> more >> >> >>>> >>> >> accurate. Is that possible? >> >> >>>> >>> > >> >> >>>> >>> > >> >> >>>> >>> > That's very well possible, because we changed the underlying >> >> >>>> >>> > LAPACK >> >> >>>> >>> > routine >> >> >>>> >>> > that's used: >> >> >>>> >>> > >> >> >>>> >>> > >> >> >>>> >>> > >> >> >>>> >>> > >> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements >> . >> >> >>>> >>> > This PR also had some lstsq precision issues, so could be >> >> >>>> >>> > useful: >> >> >>>> >>> > https://github.com/scipy/scipy/pull/5550 >> >> >>>> >>> > >> >> >>>> >>> > What is the accuracy difference in your case? >> >> >>>> >>> >> >> >>>> >>> Damn - I thought you might ask me that! >> >> >>>> >>> >> >> >>>> >>> The test where it showed up comes from some calculations on a >> >> >>>> >>> simple >> >> >>>> >>> regression problem. >> >> >>>> >>> >> >> >>>> >>> The regression model is $y = X B + e$, with ordinary least >> >> >>>> >>> squares >> >> >>>> >>> fit >> >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of >> matrix >> >> >>>> >>> X. >> >> >>>> >>> >> >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy >> >> >>>> >>> 0.16 >> >> >>>> >>> pinv, compared to scipy 0.17 pinv: >> >> >>>> >>> >> >> >>>> >>> # SSE using high-precision sympy pinv >> >> >>>> >>> 72.0232781366615313 >> >> >>>> >>> # SSE using scipy 0.16 pinv >> >> >>>> >>> 72.0232781366625261 >> >> >>>> >>> # SSE using scipy 0.17 pinv >> >> >>>> >>> 72.0232781390480170 >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> That difference looks pretty big, looks almost like a cutoff >> >> >>>> >> effect. >> >> >>>> >> What >> >> >>>> >> is the matrix X, in particular, what is its condition number or >> >> >>>> >> singular >> >> >>>> >> values? Might also be interesting to see residuals for the >> >> >>>> >> different >> >> >>>> >> functions. >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> > >> >> >>>> > Note that this might be an indication that the result of the >> >> >>>> > calculation has >> >> >>>> > an intrinsically large error bar. >> >> >>>> >> >> >>>> In [6]: np.linalg.svd(X, compute_uv=False) >> >> >>>> Out[6]: >> >> >>>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, >> >> >>>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, >> >> >>>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, >> >> >>>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, >> >> >>>> 1.72675581e-03, 9.51576795e-04]) >> >> >>>> >> >> >>>> In [7]: np.linalg.cond(X) >> >> >>>> Out[7]: 73278544935.300507 >> >> >>> >> >> >>> >> >> >>> Condition number is about 7e10, so you might expect to lose about >> 10 >> >> >>> digits of of accuracy just due to roundoff. The covariance of the >> >> >>> fitting >> >> >>> error will be `inv(X.T, X) * error`, so you can look at the >> diagonals >> >> >>> for >> >> >>> the variance of the fitted parameters just for kicks, taking >> `error` >> >> >>> to be >> >> >>> roundoff error or measurement error. This is not to say that the >> >> >>> results >> >> >>> should not be more reproducible, but should be of interest for the >> >> >>> validity >> >> >>> of the solution. Note that I suspect that there may be a >> difference >> >> >>> in the >> >> >>> default `rcond` used, or maybe the matrix is on the edge and a >> small >> >> >>> change >> >> >>> in results changes the effective rank. You can run with the >> >> >>> `return_rank=True` argument to check that. Looks like machine >> >> >>> precision is >> >> >>> used for the default value of rcond, so experimenting with that >> >> >>> probable >> >> >>> won't be informative, but you might want to try it anyway. >> >> >> >> >> >> >> >> >> >> >> >> statsmodels is using numpy pinv for the linear model and it works in >> >> >> almost all cases very well *). >> >> >> >> >> >> Using pinv with singular (given the rcond threshold) X makes >> everything >> >> >> well defined and not subject to numerical noise because of the >> >> >> regularization induced by pinv. Our covariance for the parameter >> >> >> estimates, >> >> >> based on pinv(X), is also not noisy but it's of reduced rank. >> >> > >> >> > >> >> > What I'm thinking is that changing the algorithm, as here, is like >> >> > introducing noise relative to the original, so even if the result is >> >> > perfectly valid, the fitted parameters might change significantly. A >> >> > change >> >> > in the 11-th digit wouldn't surprise me at all, depending on the >> actual >> >> > X >> >> > and rhs. >> >> >> >> I believe the errors $e = y - X^+ y$ are in general robust to >> >> differences between valid pinv solutions, even in extreme cases: >> >> >> >> In [2]: print_scalar(sse(X, npl.pinv(X), y)) >> >> 72.0232781366616024 >> >> >> >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y)) >> >> 72.0232781366939463 >> >> >> >> In [4]: npl.cond(np.c_[X, X]) >> >> Out[4]: 6.6247965335098062e+26 >> > >> > >> > >> > I think this does not depend on the smallest eigenvalues which should be >> > regularized away in cases like this. >> > The potential problem is with eigenvalues that are just a but too large >> for >> > rcond to remove them but still have numerical problems. >> > >> > I think it's a similar issue about the threshold as the one that you >> raised >> > some time ago with determining the matrix rank for near singular >> matrices. >> >> My point was (I think you're agreeing) that the accuracy of the pinv >> does not depend in a simple way on the condition number of the input >> matrix. >> > > That's not the point I wanted to make. > The condition number doesn't matter if there is an singular/eigenvalue > with rcond 1e-16 or 1e-30 or 1e-100. > However, the numerical accuracy might depend on the condition number when > there is a singular value with rcond of 2e-15 or 5e-15. > I didn't manage to find the default cond or rcond for scipy.linalg.pinv > and lstsq. The problem I had was with numpy's pinv. > I think there is a grey zone in rcond of eigenvalues that might be bad > numerically. > > (My only experience is with SVD based pinv.) > In the singular values array that you showed earlier, you have three small eigenvalues, but their rcond are only around 1e-11. So from my experience with SVD pinv, there shouldn't be any problem with a pinv in this case. Josef > > Josef > > >> >> I'm not quite sure how to assess the change in accuracy of scipy 0.17 >> on Linux, but it does seem to me to be significantly different (and >> worse than) the other options (like numpy, scipy 0.16) >> >> Matthew >> _______________________________________________ >> SciPy-Dev mailing list >> SciPy-Dev at scipy.org >> https://mail.scipy.org/mailman/listinfo/scipy-dev >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Sat Mar 19 15:31:32 2016 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 19 Mar 2016 12:31:32 -0700 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Fri, Mar 18, 2016 at 1:19 PM, wrote: > > > On Fri, Mar 18, 2016 at 4:09 PM, wrote: >> >> >> >> On Fri, Mar 18, 2016 at 3:56 PM, Matthew Brett >> wrote: >>> >>> On Fri, Mar 18, 2016 at 12:19 PM, wrote: >>> > >>> > >>> > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett >>> > >>> > wrote: >>> >> >>> >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris >>> >> wrote: >>> >> > >>> >> > >>> >> > On Wed, Mar 16, 2016 at 9:39 PM, wrote: >>> >> >> >>> >> >> >>> >> >> >>> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris >>> >> >> wrote: >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett >>> >> >>> >>> >> >>> wrote: >>> >> >>>> >>> >> >>>> Hi, >>> >> >>>> >>> >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris >>> >> >>>> wrote: >>> >> >>>> > >>> >> >>>> > >>> >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris >>> >> >>>> > wrote: >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett >>> >> >>>> >> >>> >> >>>> >> wrote: >>> >> >>>> >>> >>> >> >>>> >>> Hi, >>> >> >>>> >>> >>> >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers >>> >> >>>> >>> >>> >> >>>> >>> wrote: >>> >> >>>> >>> > >>> >> >>>> >>> > >>> >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett >>> >> >>>> >>> > >>> >> >>>> >>> > wrote: >>> >> >>>> >>> >> >>> >> >>>> >>> >> Hi, >>> >> >>>> >>> >> >>> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails >>> >> >>>> >>> >> for >>> >> >>>> >>> >> scipy >>> >> >>>> >>> >> 0.17.0: >>> >> >>>> >>> >> >>> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 >>> >> >>>> >>> >> >>> >> >>>> >>> >> This turns out to be due to small changes in the result of >>> >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the >>> >> >>>> >>> >> same >>> >> >>>> >>> >> platform >>> >> >>>> >>> >> - the travis Ubuntu in this case. >>> >> >>>> >>> >> >>> >> >>>> >>> >> Is that expected? I haven't investigated deeply, but I >>> >> >>>> >>> >> believe >>> >> >>>> >>> >> the >>> >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly >>> >> >>>> >>> >> more >>> >> >>>> >>> >> accurate. Is that possible? >>> >> >>>> >>> > >>> >> >>>> >>> > >>> >> >>>> >>> > That's very well possible, because we changed the >>> >> >>>> >>> > underlying >>> >> >>>> >>> > LAPACK >>> >> >>>> >>> > routine >>> >> >>>> >>> > that's used: >>> >> >>>> >>> > >>> >> >>>> >>> > >>> >> >>>> >>> > >>> >> >>>> >>> > >>> >> >>>> >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements. >>> >> >>>> >>> > This PR also had some lstsq precision issues, so could be >>> >> >>>> >>> > useful: >>> >> >>>> >>> > https://github.com/scipy/scipy/pull/5550 >>> >> >>>> >>> > >>> >> >>>> >>> > What is the accuracy difference in your case? >>> >> >>>> >>> >>> >> >>>> >>> Damn - I thought you might ask me that! >>> >> >>>> >>> >>> >> >>>> >>> The test where it showed up comes from some calculations on a >>> >> >>>> >>> simple >>> >> >>>> >>> regression problem. >>> >> >>>> >>> >>> >> >>>> >>> The regression model is $y = X B + e$, with ordinary least >>> >> >>>> >>> squares >>> >> >>>> >>> fit >>> >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of >>> >> >>>> >>> matrix >>> >> >>>> >>> X. >>> >> >>>> >>> >>> >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using >>> >> >>>> >>> scipy >>> >> >>>> >>> 0.16 >>> >> >>>> >>> pinv, compared to scipy 0.17 pinv: >>> >> >>>> >>> >>> >> >>>> >>> # SSE using high-precision sympy pinv >>> >> >>>> >>> 72.0232781366615313 >>> >> >>>> >>> # SSE using scipy 0.16 pinv >>> >> >>>> >>> 72.0232781366625261 >>> >> >>>> >>> # SSE using scipy 0.17 pinv >>> >> >>>> >>> 72.0232781390480170 >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> That difference looks pretty big, looks almost like a cutoff >>> >> >>>> >> effect. >>> >> >>>> >> What >>> >> >>>> >> is the matrix X, in particular, what is its condition number >>> >> >>>> >> or >>> >> >>>> >> singular >>> >> >>>> >> values? Might also be interesting to see residuals for the >>> >> >>>> >> different >>> >> >>>> >> functions. >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> > >>> >> >>>> > Note that this might be an indication that the result of the >>> >> >>>> > calculation has >>> >> >>>> > an intrinsically large error bar. >>> >> >>>> >>> >> >>>> In [6]: np.linalg.svd(X, compute_uv=False) >>> >> >>>> Out[6]: >>> >> >>>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, >>> >> >>>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, >>> >> >>>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, >>> >> >>>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, >>> >> >>>> 1.72675581e-03, 9.51576795e-04]) >>> >> >>>> >>> >> >>>> In [7]: np.linalg.cond(X) >>> >> >>>> Out[7]: 73278544935.300507 >>> >> >>> >>> >> >>> >>> >> >>> Condition number is about 7e10, so you might expect to lose about >>> >> >>> 10 >>> >> >>> digits of of accuracy just due to roundoff. The covariance of the >>> >> >>> fitting >>> >> >>> error will be `inv(X.T, X) * error`, so you can look at the >>> >> >>> diagonals >>> >> >>> for >>> >> >>> the variance of the fitted parameters just for kicks, taking >>> >> >>> `error` >>> >> >>> to be >>> >> >>> roundoff error or measurement error. This is not to say that the >>> >> >>> results >>> >> >>> should not be more reproducible, but should be of interest for the >>> >> >>> validity >>> >> >>> of the solution. Note that I suspect that there may be a >>> >> >>> difference >>> >> >>> in the >>> >> >>> default `rcond` used, or maybe the matrix is on the edge and a >>> >> >>> small >>> >> >>> change >>> >> >>> in results changes the effective rank. You can run with the >>> >> >>> `return_rank=True` argument to check that. Looks like machine >>> >> >>> precision is >>> >> >>> used for the default value of rcond, so experimenting with that >>> >> >>> probable >>> >> >>> won't be informative, but you might want to try it anyway. >>> >> >> >>> >> >> >>> >> >> >>> >> >> statsmodels is using numpy pinv for the linear model and it works >>> >> >> in >>> >> >> almost all cases very well *). >>> >> >> >>> >> >> Using pinv with singular (given the rcond threshold) X makes >>> >> >> everything >>> >> >> well defined and not subject to numerical noise because of the >>> >> >> regularization induced by pinv. Our covariance for the parameter >>> >> >> estimates, >>> >> >> based on pinv(X), is also not noisy but it's of reduced rank. >>> >> > >>> >> > >>> >> > What I'm thinking is that changing the algorithm, as here, is like >>> >> > introducing noise relative to the original, so even if the result is >>> >> > perfectly valid, the fitted parameters might change significantly. A >>> >> > change >>> >> > in the 11-th digit wouldn't surprise me at all, depending on the >>> >> > actual >>> >> > X >>> >> > and rhs. >>> >> >>> >> I believe the errors $e = y - X^+ y$ are in general robust to >>> >> differences between valid pinv solutions, even in extreme cases: >>> >> >>> >> In [2]: print_scalar(sse(X, npl.pinv(X), y)) >>> >> 72.0232781366616024 >>> >> >>> >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y)) >>> >> 72.0232781366939463 >>> >> >>> >> In [4]: npl.cond(np.c_[X, X]) >>> >> Out[4]: 6.6247965335098062e+26 >>> > >>> > >>> > >>> > I think this does not depend on the smallest eigenvalues which should >>> > be >>> > regularized away in cases like this. >>> > The potential problem is with eigenvalues that are just a but too large >>> > for >>> > rcond to remove them but still have numerical problems. >>> > >>> > I think it's a similar issue about the threshold as the one that you >>> > raised >>> > some time ago with determining the matrix rank for near singular >>> > matrices. >>> >>> My point was (I think you're agreeing) that the accuracy of the pinv >>> does not depend in a simple way on the condition number of the input >>> matrix. >> >> >> That's not the point I wanted to make. >> The condition number doesn't matter if there is an singular/eigenvalue >> with rcond 1e-16 or 1e-30 or 1e-100. >> However, the numerical accuracy might depend on the condition number when >> there is a singular value with rcond of 2e-15 or 5e-15. >> I didn't manage to find the default cond or rcond for scipy.linalg.pinv >> and lstsq. The problem I had was with numpy's pinv. >> I think there is a grey zone in rcond of eigenvalues that might be bad >> numerically. >> >> (My only experience is with SVD based pinv.) > > > In the singular values array that you showed earlier, you have three small > eigenvalues, but their rcond are only around 1e-11. > So from my experience with SVD pinv, there shouldn't be any problem with a > pinv in this case. So - is there any downside to recommending numpy.linalg.pinv over scipy.linalg.pinv in general? Cheers, Matthew From josef.pktd at gmail.com Sat Mar 19 15:58:10 2016 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sat, 19 Mar 2016 15:58:10 -0400 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: On Sat, Mar 19, 2016 at 3:31 PM, Matthew Brett wrote: > On Fri, Mar 18, 2016 at 1:19 PM, wrote: > > > > > > On Fri, Mar 18, 2016 at 4:09 PM, wrote: > >> > >> > >> > >> On Fri, Mar 18, 2016 at 3:56 PM, Matthew Brett > > >> wrote: > >>> > >>> On Fri, Mar 18, 2016 at 12:19 PM, wrote: > >>> > > >>> > > >>> > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett > >>> > > >>> > wrote: > >>> >> > >>> >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris > >>> >> wrote: > >>> >> > > >>> >> > > >>> >> > On Wed, Mar 16, 2016 at 9:39 PM, wrote: > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris > >>> >> >> wrote: > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett > >>> >> >>> > >>> >> >>> wrote: > >>> >> >>>> > >>> >> >>>> Hi, > >>> >> >>>> > >>> >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris > >>> >> >>>> wrote: > >>> >> >>>> > > >>> >> >>>> > > >>> >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris > >>> >> >>>> > wrote: > >>> >> >>>> >> > >>> >> >>>> >> > >>> >> >>>> >> > >>> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett > >>> >> >>>> >> > >>> >> >>>> >> wrote: > >>> >> >>>> >>> > >>> >> >>>> >>> Hi, > >>> >> >>>> >>> > >>> >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers > >>> >> >>>> >>> > >>> >> >>>> >>> wrote: > >>> >> >>>> >>> > > >>> >> >>>> >>> > > >>> >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett > >>> >> >>>> >>> > > >>> >> >>>> >>> > wrote: > >>> >> >>>> >>> >> > >>> >> >>>> >>> >> Hi, > >>> >> >>>> >>> >> > >>> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails > >>> >> >>>> >>> >> for > >>> >> >>>> >>> >> scipy > >>> >> >>>> >>> >> 0.17.0: > >>> >> >>>> >>> >> > >>> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563 > >>> >> >>>> >>> >> > >>> >> >>>> >>> >> This turns out to be due to small changes in the result > of > >>> >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the > >>> >> >>>> >>> >> same > >>> >> >>>> >>> >> platform > >>> >> >>>> >>> >> - the travis Ubuntu in this case. > >>> >> >>>> >>> >> > >>> >> >>>> >>> >> Is that expected? I haven't investigated deeply, but I > >>> >> >>>> >>> >> believe > >>> >> >>>> >>> >> the > >>> >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly > >>> >> >>>> >>> >> more > >>> >> >>>> >>> >> accurate. Is that possible? > >>> >> >>>> >>> > > >>> >> >>>> >>> > > >>> >> >>>> >>> > That's very well possible, because we changed the > >>> >> >>>> >>> > underlying > >>> >> >>>> >>> > LAPACK > >>> >> >>>> >>> > routine > >>> >> >>>> >>> > that's used: > >>> >> >>>> >>> > > >>> >> >>>> >>> > > >>> >> >>>> >>> > > >>> >> >>>> >>> > > >>> >> >>>> >>> > > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements > . > >>> >> >>>> >>> > This PR also had some lstsq precision issues, so could be > >>> >> >>>> >>> > useful: > >>> >> >>>> >>> > https://github.com/scipy/scipy/pull/5550 > >>> >> >>>> >>> > > >>> >> >>>> >>> > What is the accuracy difference in your case? > >>> >> >>>> >>> > >>> >> >>>> >>> Damn - I thought you might ask me that! > >>> >> >>>> >>> > >>> >> >>>> >>> The test where it showed up comes from some calculations > on a > >>> >> >>>> >>> simple > >>> >> >>>> >>> regression problem. > >>> >> >>>> >>> > >>> >> >>>> >>> The regression model is $y = X B + e$, with ordinary least > >>> >> >>>> >>> squares > >>> >> >>>> >>> fit > >>> >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of > >>> >> >>>> >>> matrix > >>> >> >>>> >>> X. > >>> >> >>>> >>> > >>> >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using > >>> >> >>>> >>> scipy > >>> >> >>>> >>> 0.16 > >>> >> >>>> >>> pinv, compared to scipy 0.17 pinv: > >>> >> >>>> >>> > >>> >> >>>> >>> # SSE using high-precision sympy pinv > >>> >> >>>> >>> 72.0232781366615313 > >>> >> >>>> >>> # SSE using scipy 0.16 pinv > >>> >> >>>> >>> 72.0232781366625261 > >>> >> >>>> >>> # SSE using scipy 0.17 pinv > >>> >> >>>> >>> 72.0232781390480170 > >>> >> >>>> >> > >>> >> >>>> >> > >>> >> >>>> >> That difference looks pretty big, looks almost like a cutoff > >>> >> >>>> >> effect. > >>> >> >>>> >> What > >>> >> >>>> >> is the matrix X, in particular, what is its condition number > >>> >> >>>> >> or > >>> >> >>>> >> singular > >>> >> >>>> >> values? Might also be interesting to see residuals for the > >>> >> >>>> >> different > >>> >> >>>> >> functions. > >>> >> >>>> >> > >>> >> >>>> >> > >>> >> >>>> >> > >>> >> >>>> > > >>> >> >>>> > Note that this might be an indication that the result of the > >>> >> >>>> > calculation has > >>> >> >>>> > an intrinsically large error bar. > >>> >> >>>> > >>> >> >>>> In [6]: np.linalg.svd(X, compute_uv=False) > >>> >> >>>> Out[6]: > >>> >> >>>> array([ 6.97301629e+07, 5.78702921e+05, 5.39646908e+05, > >>> >> >>>> 2.99383626e+05, 3.24329208e+04, 1.29520746e+02, > >>> >> >>>> 3.16197688e+01, 2.66011711e+00, 2.18287258e-01, > >>> >> >>>> 1.60105276e-01, 1.11807228e-01, 2.12694535e-03, > >>> >> >>>> 1.72675581e-03, 9.51576795e-04]) > >>> >> >>>> > >>> >> >>>> In [7]: np.linalg.cond(X) > >>> >> >>>> Out[7]: 73278544935.300507 > >>> >> >>> > >>> >> >>> > >>> >> >>> Condition number is about 7e10, so you might expect to lose > about > >>> >> >>> 10 > >>> >> >>> digits of of accuracy just due to roundoff. The covariance of > the > >>> >> >>> fitting > >>> >> >>> error will be `inv(X.T, X) * error`, so you can look at the > >>> >> >>> diagonals > >>> >> >>> for > >>> >> >>> the variance of the fitted parameters just for kicks, taking > >>> >> >>> `error` > >>> >> >>> to be > >>> >> >>> roundoff error or measurement error. This is not to say that the > >>> >> >>> results > >>> >> >>> should not be more reproducible, but should be of interest for > the > >>> >> >>> validity > >>> >> >>> of the solution. Note that I suspect that there may be a > >>> >> >>> difference > >>> >> >>> in the > >>> >> >>> default `rcond` used, or maybe the matrix is on the edge and a > >>> >> >>> small > >>> >> >>> change > >>> >> >>> in results changes the effective rank. You can run with the > >>> >> >>> `return_rank=True` argument to check that. Looks like machine > >>> >> >>> precision is > >>> >> >>> used for the default value of rcond, so experimenting with that > >>> >> >>> probable > >>> >> >>> won't be informative, but you might want to try it anyway. > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> statsmodels is using numpy pinv for the linear model and it works > >>> >> >> in > >>> >> >> almost all cases very well *). > >>> >> >> > >>> >> >> Using pinv with singular (given the rcond threshold) X makes > >>> >> >> everything > >>> >> >> well defined and not subject to numerical noise because of the > >>> >> >> regularization induced by pinv. Our covariance for the parameter > >>> >> >> estimates, > >>> >> >> based on pinv(X), is also not noisy but it's of reduced rank. > >>> >> > > >>> >> > > >>> >> > What I'm thinking is that changing the algorithm, as here, is like > >>> >> > introducing noise relative to the original, so even if the result > is > >>> >> > perfectly valid, the fitted parameters might change > significantly. A > >>> >> > change > >>> >> > in the 11-th digit wouldn't surprise me at all, depending on the > >>> >> > actual > >>> >> > X > >>> >> > and rhs. > >>> >> > >>> >> I believe the errors $e = y - X^+ y$ are in general robust to > >>> >> differences between valid pinv solutions, even in extreme cases: > >>> >> > >>> >> In [2]: print_scalar(sse(X, npl.pinv(X), y)) > >>> >> 72.0232781366616024 > >>> >> > >>> >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y)) > >>> >> 72.0232781366939463 > >>> >> > >>> >> In [4]: npl.cond(np.c_[X, X]) > >>> >> Out[4]: 6.6247965335098062e+26 > >>> > > >>> > > >>> > > >>> > I think this does not depend on the smallest eigenvalues which should > >>> > be > >>> > regularized away in cases like this. > >>> > The potential problem is with eigenvalues that are just a but too > large > >>> > for > >>> > rcond to remove them but still have numerical problems. > >>> > > >>> > I think it's a similar issue about the threshold as the one that you > >>> > raised > >>> > some time ago with determining the matrix rank for near singular > >>> > matrices. > >>> > >>> My point was (I think you're agreeing) that the accuracy of the pinv > >>> does not depend in a simple way on the condition number of the input > >>> matrix. > >> > >> > >> That's not the point I wanted to make. > >> The condition number doesn't matter if there is an singular/eigenvalue > >> with rcond 1e-16 or 1e-30 or 1e-100. > >> However, the numerical accuracy might depend on the condition number > when > >> there is a singular value with rcond of 2e-15 or 5e-15. > >> I didn't manage to find the default cond or rcond for scipy.linalg.pinv > >> and lstsq. The problem I had was with numpy's pinv. > >> I think there is a grey zone in rcond of eigenvalues that might be bad > >> numerically. > >> > >> (My only experience is with SVD based pinv.) > > > > > > In the singular values array that you showed earlier, you have three > small > > eigenvalues, but their rcond are only around 1e-11. > > So from my experience with SVD pinv, there shouldn't be any problem with > a > > pinv in this case. > > So - is there any downside to recommending numpy.linalg.pinv over > scipy.linalg.pinv in general? > The only time that I did a comparison was several years ago with the ill conditioned NIST test case. Some of the scipy pinv where a tiny bit more accurate than the numpy pinv. Smallest singular or eigenvalues were zero +/- noise. My impression was that in edge cases where numerical noise matters, some implementation details or which linalg library is used can have a visible effect. (I still have never read a numerical linear algebra book, and my "incompetence" from a thread many years ago is only tempered by 7 more years of usage.) I'm pretty happy with using np.linalg.pinv, although there are also a few cases of noise results. The main reason I stayed with numpy's pinv for statsmodels was that it was quite a bit faster than scipy pinv, partially because it doesn't do the isfinite check and maybe for some other algorithmic reasons. I haven't compared the two since then, and scipy's pinv has changed since then. statsmodels unit test for these kind of good problems are usually in the rtol=1e-12 to 1e-13 range. Going up to 1e-14 usually failed on some machine during release testing. Problems that are not so nice have lower precision. Josef > > Cheers, > > Matthew > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Mar 19 19:27:03 2016 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 19 Mar 2016 17:27:03 -0600 Subject: [SciPy-Dev] Numpy 1.11.0rc2 released. Message-ID: Hi All, Numpy 1.11.0rc2 has been released. There have been a few reversions and fixes since 1.11.0rc1, but by and large things have been pretty stable and unless something terrible happens, I will make the final next weekend. Source files and documentation can be found on Sourceforge , while source files and OS X wheels for Python 2.7, 3.3, 3.4, and 3.5 can be installed from Pypi. Please test and yell loudly if there is a problem. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Mar 21 04:11:19 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 21 Mar 2016 09:11:19 +0100 Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected? In-Reply-To: References: Message-ID: numpy.linalg.pinv uses the same solver as scipy.linalg.pinv2 (SVD) while scipy.linalg.pinv uses the least squares solver. -- Olivier From david.mikolas1 at gmail.com Mon Mar 21 11:54:13 2016 From: david.mikolas1 at gmail.com (David Mikolas) Date: Mon, 21 Mar 2016 23:54:13 +0800 Subject: [SciPy-Dev] minimize reporting successful but not searching (with default method) Message-ID: I have a situation where scipy.optimize.minimize returns the initial starting value as the answer without really iterating, and returns "successful" status, using default method. It works fine if I specify Nelder-Mead, but my question is "why does minimize return successful when it isn's, and what can I do (if anything) to flag this behavior? I originally posted in scipy-user but decided it may be more appropriate here. The question is also here http://stackoverflow.com/q/36110998 and if someone can also leave an answer there as well, that would be great! but not necessary. One particular - I'm using calls to Skyfield, which may or may not have something to do with it. (http://rhodesmill.org/skyfield/) FULL OUTPUT using DEFAULT METHOD: status: 0 success: True njev: 1 nfev: 3 hess_inv: array([[1]]) fun: 1694.98753895812 x: array([ 10000.]) message: 'Optimization terminated successfully.' jac: array([ 0.]) nit: 0 FULL OUTPUT using Nelder-Mead METHOD: status: 0 nfev: 63 success: True fun: 3.2179306044608054 x: array([ 13053.81011963]) message: 'Optimization terminated successfully.' nit: 28 -------------- next part -------------- An HTML attachment was scrubbed... URL: From pav at iki.fi Mon Mar 21 17:39:47 2016 From: pav at iki.fi (Pauli Virtanen) Date: Mon, 21 Mar 2016 21:39:47 +0000 (UTC) Subject: [SciPy-Dev] minimize reporting successful but not searching (with default method) References: Message-ID: Mon, 21 Mar 2016 23:54:13 +0800, David Mikolas kirjoitti: > It works fine if I specify Nelder-Mead, but my question is "why does > minimize return successful when it isn's Your function is piecewise constant ("staircase" pattern on a small scale). The derivative at the initial point is zero. The optimization method is a local optimizer. Therefore: the initial point is a local optimum (in the sense understood by the optimizer i.e. satisfying the KKT conditions). *** As your function is not differentiable, the KKT conditions don't mean much. Note that it works with Nelder-Mead is sort of a coincidence --- it works because the "staircase" pattern of the function is on a scale smaller than the initial simplex size. On the other hand, it fails for BFGS, because the default step size used for numerical differentiation happens to be smaller than the size of the "staircase" size. > and what can I do (if anything) to flag this behavior? Know whether your function is differentiable or not, and which optimization methods are expected to work. There's no cheap general way of detecting --- based on the numerical floating-point values output by a function --- whether a function is everywhere differentiable or continuous. This information must come from the author of the function, and optimization methods can only assume it. You can however make sanity checks, e.g., using different initial values to see if you're stuck at a local minimum. If the function is not differentiable, optimizers that use derivatives, even numerically approximated, can be unreliable. If the non-differentiability is on a small scale, you can cheat and choose a numerical differentiation step size large enough --- and hope for the best. -- Pauli Virtanen From andyfaff at gmail.com Mon Mar 21 21:00:32 2016 From: andyfaff at gmail.com (Andrew Nelson) Date: Tue, 22 Mar 2016 12:00:32 +1100 Subject: [SciPy-Dev] minimize reporting successful but not searching (with default method) In-Reply-To: References: Message-ID: If you have a rough idea of lower/upper bounds for the parameters then you could use differential_evolution, it uses a stochastic rather than gradient approach. It will require many more function evaluations though. On 22 March 2016 at 08:39, Pauli Virtanen wrote: > Mon, 21 Mar 2016 23:54:13 +0800, David Mikolas kirjoitti: > > It works fine if I specify Nelder-Mead, but my question is "why does > > minimize return successful when it isn's > > Your function is piecewise constant ("staircase" pattern on a small > scale). > > The derivative at the initial point is zero. > > The optimization method is a local optimizer. > > Therefore: the initial point is a local optimum (in the sense understood > by the optimizer i.e. satisfying the KKT conditions). > > *** > > As your function is not differentiable, the KKT conditions don't mean > much. > > Note that it works with Nelder-Mead is sort of a coincidence --- it works > because the "staircase" pattern of the function is on a scale smaller > than the initial simplex size. > > On the other hand, it fails for BFGS, because the default step size used > for numerical differentiation happens to be smaller than the size of the > "staircase" size. > > > and what can I do (if anything) to flag this behavior? > > Know whether your function is differentiable or not, and which > optimization methods are expected to work. > > There's no cheap general way of detecting --- based on the numerical > floating-point values output by a function --- whether a function is > everywhere differentiable or continuous. This information must come from > the author of the function, and optimization methods can only assume it. > > You can however make sanity checks, e.g., using different initial values > to see if you're stuck at a local minimum. > > If the function is not differentiable, optimizers that use derivatives, > even numerically approximated, can be unreliable. > > If the non-differentiability is on a small scale, you can cheat and choose > a numerical differentiation step size large enough --- and hope for the > best. > > -- > Pauli Virtanen > > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > -- _____________________________________ Dr. Andrew Nelson _____________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.mikolas1 at gmail.com Mon Mar 21 21:46:21 2016 From: david.mikolas1 at gmail.com (David Mikolas) Date: Tue, 22 Mar 2016 09:46:21 +0800 Subject: [SciPy-Dev] minimize reporting successful but not searching (with default method) In-Reply-To: References: Message-ID: Pauli thank you for taking the time to explain clearly and thoroughly. I've updated the question in stackexchange by removing words like "wrong" and "fail", and I've added a supplementary answer with a plot of the staircase behavior of the JulianDate method. http://stackoverflow.com/a/36144582 http://i.stack.imgur.com/CUuia.png That's 40 microseconds on a span of thousands of years. On Tue, Mar 22, 2016 at 5:39 AM, Pauli Virtanen wrote: > Mon, 21 Mar 2016 23:54:13 +0800, David Mikolas kirjoitti: > > It works fine if I specify Nelder-Mead, but my question is "why does > > minimize return successful when it isn's > > Your function is piecewise constant ("staircase" pattern on a small > scale). > > The derivative at the initial point is zero. > > The optimization method is a local optimizer. > > Therefore: the initial point is a local optimum (in the sense understood > by the optimizer i.e. satisfying the KKT conditions). > > *** > > As your function is not differentiable, the KKT conditions don't mean > much. > > Note that it works with Nelder-Mead is sort of a coincidence --- it works > because the "staircase" pattern of the function is on a scale smaller > than the initial simplex size. > > On the other hand, it fails for BFGS, because the default step size used > for numerical differentiation happens to be smaller than the size of the > "staircase" size. > > > and what can I do (if anything) to flag this behavior? > > Know whether your function is differentiable or not, and which > optimization methods are expected to work. > > There's no cheap general way of detecting --- based on the numerical > floating-point values output by a function --- whether a function is > everywhere differentiable or continuous. This information must come from > the author of the function, and optimization methods can only assume it. > > You can however make sanity checks, e.g., using different initial values > to see if you're stuck at a local minimum. > > If the function is not differentiable, optimizers that use derivatives, > even numerically approximated, can be unreliable. > > If the non-differentiability is on a small scale, you can cheat and choose > a numerical differentiation step size large enough --- and hope for the > best. > > -- > Pauli Virtanen > > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.mikolas1 at gmail.com Mon Mar 21 21:59:50 2016 From: david.mikolas1 at gmail.com (David Mikolas) Date: Tue, 22 Mar 2016 09:59:50 +0800 Subject: [SciPy-Dev] minimize reporting successful but not searching (with default method) In-Reply-To: References: Message-ID: Andrew, Differential evolution is new to me and therefore by definition interesting, thanks! In this particular case I definitely want the local minimum (this eclipse), and calls are expensive - there's a millisecond delay for each new instance of Julian Date arrays http://stackoverflow.com/q/35358401 But if I access the database directly, then this could be very interesting to try to search for new events! On Tue, Mar 22, 2016 at 9:00 AM, Andrew Nelson wrote: > If you have a rough idea of lower/upper bounds for the parameters then you > could use differential_evolution, it uses a stochastic rather than gradient > approach. It will require many more function evaluations though. > > > On 22 March 2016 at 08:39, Pauli Virtanen wrote: > >> Mon, 21 Mar 2016 23:54:13 +0800, David Mikolas kirjoitti: >> > It works fine if I specify Nelder-Mead, but my question is "why does >> > minimize return successful when it isn's >> >> Your function is piecewise constant ("staircase" pattern on a small >> scale). >> >> The derivative at the initial point is zero. >> >> The optimization method is a local optimizer. >> >> Therefore: the initial point is a local optimum (in the sense understood >> by the optimizer i.e. satisfying the KKT conditions). >> >> *** >> >> As your function is not differentiable, the KKT conditions don't mean >> much. >> >> Note that it works with Nelder-Mead is sort of a coincidence --- it works >> because the "staircase" pattern of the function is on a scale smaller >> than the initial simplex size. >> >> On the other hand, it fails for BFGS, because the default step size used >> for numerical differentiation happens to be smaller than the size of the >> "staircase" size. >> >> > and what can I do (if anything) to flag this behavior? >> >> Know whether your function is differentiable or not, and which >> optimization methods are expected to work. >> >> There's no cheap general way of detecting --- based on the numerical >> floating-point values output by a function --- whether a function is >> everywhere differentiable or continuous. This information must come from >> the author of the function, and optimization methods can only assume it. >> >> You can however make sanity checks, e.g., using different initial values >> to see if you're stuck at a local minimum. >> >> If the function is not differentiable, optimizers that use derivatives, >> even numerically approximated, can be unreliable. >> >> If the non-differentiability is on a small scale, you can cheat and choose >> a numerical differentiation step size large enough --- and hope for the >> best. >> >> -- >> Pauli Virtanen >> >> _______________________________________________ >> SciPy-Dev mailing list >> SciPy-Dev at scipy.org >> https://mail.scipy.org/mailman/listinfo/scipy-dev >> > > > > -- > _____________________________________ > Dr. Andrew Nelson > > > _____________________________________ > > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sun Mar 27 19:32:46 2016 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sun, 27 Mar 2016 17:32:46 -0600 Subject: [SciPy-Dev] Numpy 1.11.0 released Message-ID: Hi All, I'm pleased to announce the release of Numpy 1.11.0. This release is the result of six months of work comprising 371 merged pull requests submitted by 112 authors containing many bug fixes and improvements. Highlights are: - The datetime64 type is now timezone naive. - A dtype parameter has been added to ``randint``. - Improved detection of two arrays possibly sharing memory. - Automatic bin size estimation for ``np.histogram``. - Speed optimization of A @ A.T and dot(A, A.T). - New function ``np.moveaxis`` for reordering array axes. Source files and documentation can be found on Sourceforge , while source files and OS X wheels for Python 2.7, 3.3, 3.4, and 3.5 can be installed from Pypi. Note that this is the last release to support Python 2.6, 3.2, and 3.3. Contributors are listed below in alphabetical order with the names of new contributors starred. Abdullah Alrasheed* Aditya Panchal* Alain* Alex Griffing Alex Rogozhnikov* Alex Willmer Alexander Heger* Allan Haldane Anatoly Techtonik* Andrew Nelson Anne Archibald Antoine Pitrou Antony Lee* Behzad Nouri Bertrand Lefebvre Blake Griffith Boxiang Sun* Brigitta Sipocz* Carl Kleffner Charles Harris Chris Hogan* Christoph Gohlke Colin Jermain* Cong Ma* Daniel David Freese David Sanders* Dmitry Odzerikho* Dmitry Zagorny* Eric Larson* Eric Moore Ethan Kruse* Eugene Krokhalev* Evgeni Burovski* Francis T. O'Donovan* Fran?ois Boulogne* Gabi Davar* Gerrit Holl Gopal Singh Meena* Greg Yang* Greg Young* Gregory R. Lee Griffin Hosseinzadeh* Hassan Kibirige* Holger Kohr* Ian Henriksen Iceman9* Jaime Fernandez James Camel* Jason King* John Bjorn Nelson* John Kirkham Jonathan Helmus Jonathan Underwood* Joseph Fox-Rabinovitz* Julian Taylor Julien Dubois* Julien Lhermitte* Julien Schueller* J?rn Hees* Konstantinos Psychas* Lars Buitinck Luke Zoltan Kelley* MaPePeR* Mad Physicist* Mark Wiebe Marten van Kerkwijk Matthew Brett Matthias Geier* Maximilian Trescher* Michael K. Tran* Michael Behrisch* Michael Currie* Michael L?ffler* Nathaniel Hellabyte* Nathaniel J. Smith Nick Papior* Nicolas Calle* Nicol?s Della Penna* Olivier Grisel Pauli Virtanen Peter Iannucci Phaiax* Ralf Gommers Rehas Sachdeva* Ronan Lamy Ruediger Meier* Ryan Grout* Ryosuke Okuta* R?my L?one* Sam Radhakrishnan* Samuel St-Jean* Sebastian Berg Simon Conseil* Stephan Hoyer Stuart Archibald* Stuart Berg Sumith* Tapasweni Pathak* Thomas Robitaille Tobias Megies* Tushar Gautam* Varun Nayyar* Vincent Legoll* Warren Weckesser Wendell Smith Yash Mehrotra* Yifan Li* endolith floatingpointstack* ldoddema* yolanda15* Thanks to all who worked on this release. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From evgeny.burovskiy at gmail.com Sun Mar 27 19:44:19 2016 From: evgeny.burovskiy at gmail.com (Evgeni Burovski) Date: Mon, 28 Mar 2016 00:44:19 +0100 Subject: [SciPy-Dev] Numpy 1.11.0 released In-Reply-To: References: Message-ID: Thank you for doing this Chuck. 28.03.2016 2:32 ???????????? "Charles R Harris" ???????: > Hi All, > > I'm pleased to announce the release of Numpy 1.11.0. This release is the > result of six months of work comprising 371 merged pull requests submitted > by 112 authors containing many bug fixes and improvements. Highlights are: > > - The datetime64 type is now timezone naive. > - A dtype parameter has been added to ``randint``. > - Improved detection of two arrays possibly sharing memory. > - Automatic bin size estimation for ``np.histogram``. > - Speed optimization of A @ A.T and dot(A, A.T). > - New function ``np.moveaxis`` for reordering array axes. > > Source files and documentation can be found on Sourceforge > , while > source files and OS X wheels for Python 2.7, 3.3, 3.4, and 3.5 can be > installed from Pypi. Note that this is the last release to support Python > 2.6, 3.2, and 3.3. > > Contributors are listed below in alphabetical order with the names of new > contributors starred. > > Abdullah Alrasheed* > Aditya Panchal* > Alain* > Alex Griffing > Alex Rogozhnikov* > Alex Willmer > Alexander Heger* > Allan Haldane > Anatoly Techtonik* > Andrew Nelson > Anne Archibald > Antoine Pitrou > Antony Lee* > Behzad Nouri > Bertrand Lefebvre > Blake Griffith > Boxiang Sun* > Brigitta Sipocz* > Carl Kleffner > Charles Harris > Chris Hogan* > Christoph Gohlke > Colin Jermain* > Cong Ma* > Daniel > David Freese > David Sanders* > Dmitry Odzerikho* > Dmitry Zagorny* > Eric Larson* > Eric Moore > Ethan Kruse* > Eugene Krokhalev* > Evgeni Burovski* > Francis T. O'Donovan* > Fran?ois Boulogne* > Gabi Davar* > Gerrit Holl > Gopal Singh Meena* > Greg Yang* > Greg Young* > Gregory R. Lee > Griffin Hosseinzadeh* > Hassan Kibirige* > Holger Kohr* > Ian Henriksen > Iceman9* > Jaime Fernandez > James Camel* > Jason King* > John Bjorn Nelson* > John Kirkham > Jonathan Helmus > Jonathan Underwood* > Joseph Fox-Rabinovitz* > Julian Taylor > Julien Dubois* > Julien Lhermitte* > Julien Schueller* > J?rn Hees* > Konstantinos Psychas* > Lars Buitinck > Luke Zoltan Kelley* > MaPePeR* > Mad Physicist* > Mark Wiebe > Marten van Kerkwijk > Matthew Brett > Matthias Geier* > Maximilian Trescher* > Michael K. Tran* > Michael Behrisch* > Michael Currie* > Michael L?ffler* > Nathaniel Hellabyte* > Nathaniel J. Smith > Nick Papior* > Nicolas Calle* > Nicol?s Della Penna* > Olivier Grisel > Pauli Virtanen > Peter Iannucci > Phaiax* > Ralf Gommers > Rehas Sachdeva* > Ronan Lamy > Ruediger Meier* > Ryan Grout* > Ryosuke Okuta* > R?my L?one* > Sam Radhakrishnan* > Samuel St-Jean* > Sebastian Berg > Simon Conseil* > Stephan Hoyer > Stuart Archibald* > Stuart Berg > Sumith* > Tapasweni Pathak* > Thomas Robitaille > Tobias Megies* > Tushar Gautam* > Varun Nayyar* > Vincent Legoll* > Warren Weckesser > Wendell Smith > Yash Mehrotra* > Yifan Li* > endolith > floatingpointstack* > ldoddema* > yolanda15* > > Thanks to all who worked on this release. > > Chuck > > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From evgeny.burovskiy at gmail.com Mon Mar 28 06:29:26 2016 From: evgeny.burovskiy at gmail.com (Evgeni Burovski) Date: Mon, 28 Mar 2016 11:29:26 +0100 Subject: [SciPy-Dev] RFC: sparse DOK array Message-ID: Hi, TL;DR: Here's a draft implementation of a 'sparse array', suitable for incremental assembly. Comments welcome! https://github.com/ev-br/sparr Scipy sparse matrices are generally great. One area where they are not very great is the formats suitable for incremental assembly (esp LIL and DOK matrices). This has been discussed for a while, see, for instance, https://github.com/scipy/scipy/pull/2908#discussion_r16955546 I spent some time to see how hard is it to build a dictionary-of-keys data structure without Python overhead. Turns out, not very hard. Also, when constructing a sparse data structure now, it seems to make sense to make it a sparse *array* rather than *matrix*, so that it supports both elementwise and matrix multiplication. A draft implementation is available at https://github.com/ev-br/sparr and I'd be grateful for any comments, large or small. Some usage examples are in the README file which Github shows at the front page above. First and foremost, I'd like to gauge interest in the community ;-). Does it actually make sense? Would you use such a data structure? What is missing in the current version? Short to medium term, some issues I see are: * 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types. I wonder if this is purely backwards compatibility. Naively, it seems to me that a new class could just always use 64-bit indices, but this might be too naive? * Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea. * "Object" dtype. So far, there isn't one. I wonder if it's needed or having only numeric types would be enough. * Interoperation with numpy arrays and other sparse matrices. I guess __numpy_ufunc__ would *the* solution here, when available. For now, I do something simple based on special-casing and __array_priority__. Sparse matrices almost work, but there are glitches. This is just a non-exhaustive list of things I'm thinking about. If there's something else (and I'm sure there is plenty!) I'm all ears --- either in this email thread or in the issue tracker. If someone feels like commenting on implementation details, sure, none of those is cast in stone and plumbing is definitely in a bit of a flux. Cheers, Evgeni From shoyer at gmail.com Mon Mar 28 12:13:25 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 28 Mar 2016 09:13:25 -0700 Subject: [SciPy-Dev] RFC: sparse DOK array In-Reply-To: References: Message-ID: On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski wrote: > First and foremost, I'd like to gauge interest in the community ;-). > Does it actually make sense? Would you use such a data structure? What is > missing in the current version? > This looks awesome, and makes complete sense to me! In particular, xarray could really use an n-dimensional sparse structure. A few other things small things I'd like to see: - Support for slicing, even if it's expensive. - A strict way to set the shape without automatic expansion, if desired (e.g., if shape is provided in the constructor). - Default to the dtype of the fill_value. NumPy does this for np.full. > Short to medium term, some issues I see are: > > * 32-bit vs 64-bit indices. Scipy sparse matrices switch between index > types. > I wonder if this is purely backwards compatibility. Naively, it seems to me > that a new class could just always use 64-bit indices, but this might > be too naive? > > * Data types and casting rules. For now, I basically piggy-back on > numpy's rules. > There are several slightly different ones (numba has one?), and there > might be > an opportunity to simplify the rules. OTOH, inventing one more subtly > different > set of rules might be a bad idea. > Yes, please follow NumPy. * "Object" dtype. So far, there isn't one. I wonder if it's needed or having > only numeric types would be enough. > This would be marginally useful -- eventually someone is going to want to store some strings in a sparse array, and NumPy doesn't handle this very well. Thus pandas, h5py and xarray all end up using dtype=object for variable length strings. (pandas/xarray even take the monstrous approach of using np.nan as a sentinel missing value.) > * Interoperation with numpy arrays and other sparse matrices. I guess > __numpy_ufunc__ would *the* solution here, when available. > For now, I do something simple based on special-casing and > __array_priority__. > Sparse matrices almost work, but there are glitches. > Yes, __array_priority__ is about the best we can do now. You could actually use a mix of __array_prepare__ and __array_wrap__ to make (non-generalized) ufuncs work, e.g., for functions like np.sin: - In __array_prepare__, return the non-fill values of the array concatenated with the fill value. - In __array_wrap__, reshape all but the last element to build a new sparse array, using the last element for the new fill value. This would be a neat trick and get you most of what you could hope for from __numpy_ufunc__. -------------- next part -------------- An HTML attachment was scrubbed... URL: From evgeny.burovskiy at gmail.com Mon Mar 28 12:19:12 2016 From: evgeny.burovskiy at gmail.com (Evgeni Burovski) Date: Mon, 28 Mar 2016 17:19:12 +0100 Subject: [SciPy-Dev] RFC: sparse DOK array In-Reply-To: References: Message-ID: For completeness I'd mention an alternative: https://github.com/scipy/scipy/pull/6004 This is a completely independent effort framed better for the scipy.sparse framework. On Mon, Mar 28, 2016 at 11:29 AM, Evgeni Burovski wrote: > Hi, > > TL;DR: Here's a draft implementation of a 'sparse array', suitable for > incremental > assembly. Comments welcome! > https://github.com/ev-br/sparr > > Scipy sparse matrices are generally great. One area where they are not > very great > is the formats suitable for incremental assembly (esp LIL and DOK > matrices). This > has been discussed for a while, see, for instance, > https://github.com/scipy/scipy/pull/2908#discussion_r16955546 > > I spent some time to see how hard is it to build a dictionary-of-keys > data structure without Python overhead. Turns out, not very hard. > > Also, when constructing a sparse data structure now, it seems to make sense > to make it a sparse *array* rather than *matrix*, so that it supports both > elementwise and matrix multiplication. > > A draft implementation is available at https://github.com/ev-br/sparr and > I'd be grateful for any comments, large or small. > > Some usage examples are in the README file which Github shows at the > front page above. > > First and foremost, I'd like to gauge interest in the community ;-). > Does it actually make sense? Would you use such a data structure? What is > missing in the current version? > > Short to medium term, some issues I see are: > > * 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types. > I wonder if this is purely backwards compatibility. Naively, it seems to me > that a new class could just always use 64-bit indices, but this might > be too naive? > > * Data types and casting rules. For now, I basically piggy-back on > numpy's rules. > There are several slightly different ones (numba has one?), and there might be > an opportunity to simplify the rules. OTOH, inventing one more subtly different > set of rules might be a bad idea. > > * "Object" dtype. So far, there isn't one. I wonder if it's needed or having > only numeric types would be enough. > > * Interoperation with numpy arrays and other sparse matrices. I guess > __numpy_ufunc__ would *the* solution here, when available. > For now, I do something simple based on special-casing and __array_priority__. > Sparse matrices almost work, but there are glitches. > > This is just a non-exhaustive list of things I'm thinking about. If there's > something else (and I'm sure there is plenty!) I'm all ears --- either > in this email thread or in the issue tracker. > If someone feels like commenting on implementation details, sure, none of those > is cast in stone and plumbing is definitely in a bit of a flux. > > Cheers, > > Evgeni From perimosocordiae at gmail.com Mon Mar 28 12:37:26 2016 From: perimosocordiae at gmail.com (CJ Carey) Date: Mon, 28 Mar 2016 11:37:26 -0500 Subject: [SciPy-Dev] RFC: sparse DOK array In-Reply-To: References: Message-ID: Another alternative: https://github.com/perimosocordiae/sparray I tried to follow the numpy API as closely as I could, to see if something like this could work. The code isn't particularly optimized, but it does beat CSR/CSC spmatrix formats on a few benchmarks. My plan going forward is to have swappable (opaque) backends to allow for tradeoffs in speed/memory for access/construction/math operations. Apologies for the lacking documentation! The only good listing of current capabilities at the moment is the test suite ( https://github.com/perimosocordiae/sparray/tree/master/sparray/tests). On Mon, Mar 28, 2016 at 11:19 AM, Evgeni Burovski < evgeny.burovskiy at gmail.com> wrote: > For completeness I'd mention an alternative: > https://github.com/scipy/scipy/pull/6004 > This is a completely independent effort framed better for the > scipy.sparse framework. > > > On Mon, Mar 28, 2016 at 11:29 AM, Evgeni Burovski > wrote: > > Hi, > > > > TL;DR: Here's a draft implementation of a 'sparse array', suitable for > > incremental > > assembly. Comments welcome! > > https://github.com/ev-br/sparr > > > > Scipy sparse matrices are generally great. One area where they are not > > very great > > is the formats suitable for incremental assembly (esp LIL and DOK > > matrices). This > > has been discussed for a while, see, for instance, > > https://github.com/scipy/scipy/pull/2908#discussion_r16955546 > > > > I spent some time to see how hard is it to build a dictionary-of-keys > > data structure without Python overhead. Turns out, not very hard. > > > > Also, when constructing a sparse data structure now, it seems to make > sense > > to make it a sparse *array* rather than *matrix*, so that it supports > both > > elementwise and matrix multiplication. > > > > A draft implementation is available at https://github.com/ev-br/sparr > and > > I'd be grateful for any comments, large or small. > > > > Some usage examples are in the README file which Github shows at the > > front page above. > > > > First and foremost, I'd like to gauge interest in the community ;-). > > Does it actually make sense? Would you use such a data structure? What is > > missing in the current version? > > > > Short to medium term, some issues I see are: > > > > * 32-bit vs 64-bit indices. Scipy sparse matrices switch between index > types. > > I wonder if this is purely backwards compatibility. Naively, it seems to > me > > that a new class could just always use 64-bit indices, but this might > > be too naive? > > > > * Data types and casting rules. For now, I basically piggy-back on > > numpy's rules. > > There are several slightly different ones (numba has one?), and there > might be > > an opportunity to simplify the rules. OTOH, inventing one more subtly > different > > set of rules might be a bad idea. > > > > * "Object" dtype. So far, there isn't one. I wonder if it's needed or > having > > only numeric types would be enough. > > > > * Interoperation with numpy arrays and other sparse matrices. I guess > > __numpy_ufunc__ would *the* solution here, when available. > > For now, I do something simple based on special-casing and > __array_priority__. > > Sparse matrices almost work, but there are glitches. > > > > This is just a non-exhaustive list of things I'm thinking about. If > there's > > something else (and I'm sure there is plenty!) I'm all ears --- either > > in this email thread or in the issue tracker. > > If someone feels like commenting on implementation details, sure, none > of those > > is cast in stone and plumbing is definitely in a bit of a flux. > > > > Cheers, > > > > Evgeni > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev at scipy.org > https://mail.scipy.org/mailman/listinfo/scipy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From evgeny.burovskiy at gmail.com Mon Mar 28 17:37:51 2016 From: evgeny.burovskiy at gmail.com (Evgeni Burovski) Date: Mon, 28 Mar 2016 22:37:51 +0100 Subject: [SciPy-Dev] RFC: sparse DOK array In-Reply-To: References: Message-ID: Thanks Stephan, > A few other things small things I'd like to see: > - Support for slicing, even if it's expensive. Slicing is on the TODO list. Only needs a bit of plumbing work. > - A strict way to set the shape without automatic expansion, if desired > (e.g., if shape is provided in the constructor). You can set the initial shape in the constructor. Or you mean a flag to freeze the shape once it's set and have `__setitem__(out-of-bounds)` raise an error? > - Default to the dtype of the fill_value. NumPy does this for np.full. Thanks for the suggestion --- done and implemented! >> * Data types and casting rules. For now, I basically piggy-back on >> numpy's rules. >> There are several slightly different ones (numba has one?), and there >> might be >> an opportunity to simplify the rules. OTOH, inventing one more subtly >> different >> set of rules might be a bad idea. > > > Yes, please follow NumPy. One thing I'm wondering is the numpy rule that scalars never upcast arrays. Is it something people actually rely on? [In my experience, I only had to work around it, but my experience might be singular.] > You could actually use a mix of __array_prepare__ and __array_wrap__ to make > (non-generalized) ufuncs work, e.g., for functions like np.sin: > > - In __array_prepare__, return the non-fill values of the array concatenated > with the fill value. > - In __array_wrap__, reshape all but the last element to build a new sparse > array, using the last element for the new fill value. > > This would be a neat trick and get you most of what you could hope for from > __numpy_ufunc__. This is really neat indeed! I've flagged it in https://github.com/ev-br/sparr/issues/35 At the moment, I'm dealing with something much less cool, which I suspect I'm not the first one: given m a MapArray and csr a scipy.sparse matrix, - m * csr produces a MapArray holding the result of the elementwise multiplication, but - csr * m fails with the dimension mismatch error when dimensions are OK for elementwise multiply but not matrix multiply. The failure is somewhere in the scipy.sparse code. I tried playing with __array_priority__, but so far I did not manage to convince scipy.sparse matrices to defer cleanly to the right-hand multiplier (left-hand multiplier is OK). Cheers, Evgeni From evgeny.burovskiy at gmail.com Mon Mar 28 17:51:04 2016 From: evgeny.burovskiy at gmail.com (Evgeni Burovski) Date: Mon, 28 Mar 2016 22:51:04 +0100 Subject: [SciPy-Dev] RFC: sparse DOK array In-Reply-To: References: Message-ID: On Mon, Mar 28, 2016 at 5:37 PM, CJ Carey wrote: > Another alternative: https://github.com/perimosocordiae/sparray > > I tried to follow the numpy API as closely as I could, to see if something > like this could work. The code isn't particularly optimized, but it does > beat CSR/CSC spmatrix formats on a few benchmarks. My plan going forward is > to have swappable (opaque) backends to allow for tradeoffs in speed/memory > for access/construction/math operations. > > Apologies for the lacking documentation! The only good listing of current > capabilities at the moment is the test suite > (https://github.com/perimosocordiae/sparray/tree/master/sparray/tests). Wow, a great one from the Happy Valley! I've quite a bit of catching up to do --- you are miles ahead of me in terms of feature completeness. One difference I see is that you seem to be using a flat index. This likely does not play too well with assembly, which is what I tried to take into account. Would you be interested to join forces? Cheers, Evgeni From perimosocordiae at gmail.com Tue Mar 29 12:50:08 2016 From: perimosocordiae at gmail.com (CJ Carey) Date: Tue, 29 Mar 2016 11:50:08 -0500 Subject: [SciPy-Dev] RFC: sparse DOK array In-Reply-To: References: Message-ID: > > > One difference I see is that you seem to be using a flat index. This > likely does not play too well with assembly, which is what I tried to > take into account. > > Yeah, I took the easy route of using a single, sorted, flat index. This makes a lot of operations easy, especially with the help of np.ravel_multi_index and friends. You're correct that it makes assembly and other modifications to the sparsity structure difficult, though! For example: https://github.com/perimosocordiae/sparray/blob/master/sparray/base.py#L335 > Would you be interested to join forces? > I've been meaning to add multiple backends to work around that problem, and I expect that merging our efforts will make that possible. Let me know how you want to proceed. -CJ -------------- next part -------------- An HTML attachment was scrubbed... URL: From travis at continuum.io Tue Mar 29 16:28:58 2016 From: travis at continuum.io (Travis Oliphant) Date: Tue, 29 Mar 2016 13:28:58 -0700 Subject: [SciPy-Dev] Blog-post that explains what Blaze actually is and where Pluribus project now lives. Message-ID: I have emailed this list in the past explaining what is driving my open source efforts now. Here is a blog-post that may help some of you understand at little bit of the history of Blaze, DyND Numba, and other related developments as they relate to scaling up and scaling out array-computing in Python. http://technicaldiscovery.blogspot.com/2016/03/anaconda-and-hadoop-story-of-journey.html This post and these projects do not have anything to do with the future of the NumPy and/or SciPy projects which are now in great hands guiding their community-driven development. The post is however, a discussion of additional projects that will hopefully benefit some of you as well, and for which your feedback and assistance is welcome. Best, -Travis -- *Travis Oliphant, PhD* *Co-founder and CEO* @teoliphant 512-222-5440 http://www.continuum.io -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Wed Mar 30 01:28:55 2016 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 29 Mar 2016 22:28:55 -0700 Subject: [SciPy-Dev] RFC: sparse DOK array In-Reply-To: References: Message-ID: On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski wrote: > Hi, > > TL;DR: Here's a draft implementation of a 'sparse array', suitable for > incremental > assembly. Comments welcome! > https://github.com/ev-br/sparr > > Scipy sparse matrices are generally great. One area where they are not > very great > is the formats suitable for incremental assembly (esp LIL and DOK > matrices). This > has been discussed for a while, see, for instance, > https://github.com/scipy/scipy/pull/2908#discussion_r16955546 > > I spent some time to see how hard is it to build a dictionary-of-keys > data structure without Python overhead. Turns out, not very hard. > > Also, when constructing a sparse data structure now, it seems to make sense > to make it a sparse *array* rather than *matrix*, so that it supports both > elementwise and matrix multiplication. > > A draft implementation is available at https://github.com/ev-br/sparr and > I'd be grateful for any comments, large or small. > > Some usage examples are in the README file which Github shows at the > front page above. > > First and foremost, I'd like to gauge interest in the community ;-). > Does it actually make sense? Would you use such a data structure? What is > missing in the current version? This is really cool! The continuing importance of scipy.sparse is basically the only reason np.matrix hasn't been deprecated yet. I've noodled around a little with designs for similar things at various points, so a few general comments based on that that you can take or leave :-) - For general operations on n-dimensional sparse arrays, COO format seems like the obvious thing to use. The strategy I've envisioned is storing the data as N arrays of indices + 1 array of values (or possibly you'd want to pack these together into a single array of (struct containing N+1 values)), plus a bit of metadata indicating the current sort order -- so sort_order=(0, 1) would mean that the items are sorted lexicographically by row and then by column (so like CSC), and sort_order=(1, 0) would mean the opposite. Then operations like sum(axis=1) would be "sort by axis 1 (if not already there); then sum over each contiguous chunk of the data"; sum(axis=(0, 2)) would be "sort to make sure axes 0 and 2 are in the front (but I don't care in which order)", etc. It may be useful to use a stable sort, so that if you have sort_order=(0, 1, 2) and then sort by axis 2, you get sort_order=(2, 0, 1). For binary operations, you ensure that the two arrays have the same sort order (possibly re-sorting the smaller one), and then do a parallel iteration over both of them finding matching indices. - Instead of proliferating data types for all kinds of different storage formats, I'd really prefer to see a single data type that "just works". So for example, instead of having one storage format that's optimized for building matrices and one that's optimized for operations on matrices, you could have a single object which has: - COO format representation - a DOK of "pending updates" and writes would go into the DOK part, and at the beginning of each non-trivial operation you'd call self._get_coo_arrays() or whatever, which would do if self._pending_updates: # do a single batch reallocation and incorporate all the pending updates return self._real_coo_arrays There are a lot of potentially viable variants on this using different data structures for the various pieces, different detailed reallocation strategies, etc., but you get the idea. For interface with external C/C++/Fortran code that wants CSR or CSC, I'd just have get_csr() and get_csc() methods that return a tuple of (indptr, indices, data), and not even try to implement any actual operations on this representation. > * Data types and casting rules. For now, I basically piggy-back on > numpy's rules. > There are several slightly different ones (numba has one?), and there might be > an opportunity to simplify the rules. OTOH, inventing one more subtly different > set of rules might be a bad idea. It would be really nice if you could just use numpy's own dtype objects and ufuncs instead of implementing your own. I don't know how viable that is right now, but I think it's worth trying, and if you can't then I'd like to know what part broke :-). Happy to help if you need some tips / further explanation. -n -- Nathaniel J. Smith -- https://vorpus.org From cgodshall at enthought.com Tue Mar 29 11:13:34 2016 From: cgodshall at enthought.com (Courtenay Godshall (Enthought)) Date: Tue, 29 Mar 2016 15:13:34 -0000 Subject: [SciPy-Dev] ANN: LAST CALL for SciPy (Scientific Python) 2016 Conference Talk / Poster Proposals - Due Friday 4/1 Message-ID: <008801d189cd$9108d070$b31a7150$@enthought.com> **ANN: LAST CALL for SciPy 2016 Conference Talk / Poster Proposals (Scientific Computing with Python) - Final Deadline - Friday April 1st** SciPy 2016, the 15th annual Scientific Computing with Python conference, will be held July 11-17, 2016 in Austin, Texas. SciPy is a community dedicated to the advancement of scientific computing through open source Python software for mathematics, science, and engineering. The annual SciPy Conference brings together over 650 participants from industry, academia, and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development. The full program will consist of 2 days of tutorials (July 11-12), 3 days of talks (July 13-15), and 2 days of developer sprints (July 16-17). More info is available on the conference website at http://scipy2016.scipy.org (where you can sign up for the mailing list); or follow @scipyconf on Twitter. ---------------------------------------------------------------------------- --------------------------------- **SUBMIT A SCIPY 2016 TALK / POSTER PROPOSAL - DUE APRIL 1, 2016* ---------------------------------------------------------------------------- --------------------------------- Submissions for talks and posters are welcome on our website ( http://scipy2016.scipy.org). In your abstract, please provide details on what Python tools are being employed, and how. The talk and poster submission deadline is March 25th, 2016 SciPy 2016 will include 3 major topic tracks and 8 mini-symposia tracks. Major topic tracks include: - Scientific Computing in Python - Python in Data Science (Big data and not so big data) - High Performance Computing Mini-symposia will include the applications of Python in: - Earth and Space Science - Engineering - Medicine and Biology - Social Sciences - Special Purpose Databases - Case Studies in Industry - Education - Reproducibility If you have any questions or comments, feel free to contact us at: scipy-organizers at scipy.org ----------------------------------------------------------- **SCIPY 2016 REGISTRATION IS OPEN** ----------------------------------------------------------- Please register early. SciPy early bird registration until May 22, 2016. Register at http://scipy2016.scipy.org. Plus, enter our t-shirt design contest to win a free registration. (Send a vector art file to scipy at enthought.com by March 31 to enter). Important dates: April 1: Talk and Poster Proposals Due May 11: Plotting Contest Submissions Due Apr 22: Tutorials Announced Apr 22: Financial Aid Submissions Due May 4: Talk and Posters Announced May 22: Early Bird Registration Deadline Jul 11-12: SciPy 2016 Tutorials Jul 13-15: SciPy 2016 General Conference Jul 16-17: SciPy 2016 Sprints We look forward to an exciting conference and hope to see you in Austin in July! The Scipy 2016 Committee http://scipy2016.scipy.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: