From gfyoung17 at gmail.com  Thu Mar  3 05:01:30 2016
From: gfyoung17 at gmail.com (G Young)
Date: Thu, 3 Mar 2016 10:01:30 +0000
Subject: [SciPy-Dev] "invalid" matrices in signaltools.py
Message-ID: <CAJ1_J5gDrvn=TJUyBe4cs91+OJL8-sMDbzaXbtrCpiqwcqsiZA@mail.gmail.com>

Hello all,

I have a PR up here <https://github.com/scipy/scipy/pull/5904> that allows
functions like "convolve" and "correlate" to accept matrices in either
order for the "valid" order.  People seem to be in general agreement that
either order should be accepted.  The question is how to handle inputs
where one matrix is not strictly larger than the other (i.e. the dimensions
of one matrix are not all at least as large as the analogous dimension of
another).

My proposal is to let such arguments pass through, as the function will
always return the empty matrix as expected.  However, others are more in
favor of throwing an Exception.  What are people's thoughts?

Thanks!

Greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160303/70ac0484/attachment.html>

From sebastian at sipsolutions.net  Thu Mar  3 05:26:57 2016
From: sebastian at sipsolutions.net (Sebastian Berg)
Date: Thu, 03 Mar 2016 11:26:57 +0100
Subject: [SciPy-Dev] "invalid" matrices in signaltools.py
In-Reply-To: <CAJ1_J5gDrvn=TJUyBe4cs91+OJL8-sMDbzaXbtrCpiqwcqsiZA@mail.gmail.com>
References: <CAJ1_J5gDrvn=TJUyBe4cs91+OJL8-sMDbzaXbtrCpiqwcqsiZA@mail.gmail.com>
Message-ID: <1457000817.26057.9.camel@sipsolutions.net>

On Do, 2016-03-03 at 10:01 +0000, G Young wrote:
> Hello all,
> 
> I have a PR up here that allows functions like "convolve" and
> "correlate" to accept matrices in either order for the "valid" order.
>   People seem to be in general agreement that either order should be
> accepted.  The question is how to handle inputs where one matrix is
> not strictly larger than the other (i.e. the dimensions of one matrix
> are not all at least as large as the analogous dimension of another).
> 
> My proposal is to let such arguments pass through, as the function
> will always return the empty matrix as expected.  However, others are
> more in favor of throwing an Exception.  What are people's thoughts?
> 

Not really my thing, but my gut feeling would be an exception as well;
unless someone can think of an example where it might be useful (i.e.
for empty arrays you can often construct examples by thinking of data
stream chunks or groups which happen to have 0 items in them).

The empty array in general does seem to loose information though. After
all, you do not have an empty array of shape (3, -2), but of shape (3,
0), and even then since "valid" mode switches which array is the larger
one, should it be (3, 0) or maybe (0, 2)?

- Sebastian


> Thanks!
> 
> Greg
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160303/f4c1419f/attachment.sig>

From evgeny.burovskiy at gmail.com  Thu Mar  3 11:59:23 2016
From: evgeny.burovskiy at gmail.com (Evgeni Burovski)
Date: Thu, 3 Mar 2016 16:59:23 +0000
Subject: [SciPy-Dev] Fwd: [Numpy-discussion] GSoC?
In-Reply-To: <CABL7CQhwOpMx=NFoNMuP57gZ+zeKN1+rWv8=dgvhFgvsOreWhA@mail.gmail.com>
References: <CALGmxEK2Vu5GWAV+3Q0XXYXw2JqXpZU=9Vy3A0wv5zhupdHf6g@mail.gmail.com>
 <CAHkUDNVdTvBCoipwC6rpczhg7ueNk9-EQ3tsLmj7dwpy3d-BJQ@mail.gmail.com>
 <CALGmxELTJE0zMCvxGwxMs18qMXnxDoC4Ny81VUB4Fm_j6b4PRw@mail.gmail.com>
 <CABL7CQhX_E0gMkJane2yXABD6E=LH_yV4beh4UHutxLE5de1hQ@mail.gmail.com>
 <CALGmxE+=0N6xndb2g6gMbEjDaheZN7TS5VKM9YsVt4arBuPf4A@mail.gmail.com>
 <CABL7CQiwQ5MoqGMUeeEzUPh4z=PbeAOPH3e56CFpUOcvu9maww@mail.gmail.com>
 <CABL7CQhwOpMx=NFoNMuP57gZ+zeKN1+rWv8=dgvhFgvsOreWhA@mail.gmail.com>
Message-ID: <CAMRo0itj0iqvGh_+i4fjvMm8LuybR-rzx3=3qcmJmtNmC2M6GQ@mail.gmail.com>

---------- Forwarded message ----------
From: Ralf Gommers <ralf.gommers at gmail.com>
Date: Wed, Feb 10, 2016 at 11:02 PM
Subject: Re: [Numpy-discussion] GSoC?
To: Discussion of Numerical Python <numpy-discussion at scipy.org>


On Wed, Feb 10, 2016 at 11:55 PM, Ralf Gommers <ralf.gommers at gmail.com> wrote:
>
>
>
> On Wed, Feb 10, 2016 at 11:48 PM, Chris Barker <chris.barker at noaa.gov> wrote:
>>
>> Thanks Ralf,
>>
>>> Note that we have always done a combined numpy/scipy ideas page and submission. For really good students numpy may be the right challenge, but in general scipy is easier to get started on.
>>
>>
>> yup -- good idea. Is there a page ready to go, or do we need to get one up? (I don't even know where to put it...)
>
>
> This is last year's page: https://github.com/scipy/scipy/wiki/GSoC-2015-project-ideas
>
> Some ideas have been worked on, others are still relevant. Let's copy this page to -2016- and start editing it and adding new ideas. I'll start right now actually.


OK first version: https://github.com/scipy/scipy/wiki/GSoC-2016-project-ideas
I kept some of the ideas from last year, but removed all potential
mentors as the same people may not be available this year - please
re-add yourselves where needed.

And to everyone who has a good idea, and preferably is willing to
mentor for that idea: please add it to that page.


Here are a couple of vague ideas which I'd like to hash out on the
list first. Both are mostly in the request-for-comment,
feedback-welcome stage.
(I am going to write them out even though I cannot guarantee that I'll
be available to mentor either of them.)


1. Numeric differentiation. The current proposal focuses on bundling
numdifftools. I think there's room for a complementary set of routines
for numeric differentiation with fixed step, or limited intelligence
w.r.t. the step selection. In fact, there is a fairly comprehensive
implementation in scipy already, a by-product of last year's GSoC.
It's not exposed publicly, and is hidden as
scipy.optimize._numdiff.approx_derivative. I suspect that what's
already there covers the bulk of the implementation, and what's left
is a only a bit of work to spec out the user-facing API, especially
for the broadcasting behavior, and then replacing semi-broken bits of
scipy.misc.derivative, scipy.optimize.approx_fprime etc.
The original PR has quite extensive discussion,
https://github.com/scipy/scipy/pull/4884
And here's an unfinished attempt at following up on that discussion:
https://gist.github.com/ev-br/7a1edb3f250bd375c46a


2. Optimization benchmarks.

There are two quite large bodies of work on benchmarking various
optimization algorithms:

- Andrew Nelson's global optimizers benchmarks,
https://github.com/scipy/scipy/pull/4191.

- Nikolay Mayorov's nonliear LSQ benchmarks, some of which are
available in scipy/benchmarks, and the most recent version is at
https://github.com/nmayorov/lsq-benchmarks


I think both can be very valuable for a broader scipy ecosystem, and
both of these herculean efforts are currently blocked on the question
of how to actually run them. It seems to me that it would be a useful
project to come up with a standardized way of running these sorts of
benchmarks and publishing the results. Ideally, it would include,

* a runner. Maybe an ASV-based one, but then the project might include
some enhancements to asv itself.
* a way of pushlishing the results (again, asv has one, is it sufficient?)
* a self-contained environment for running and publishing (a VM image?
a docker container?)

So that by the end of the project we would have a web page with the
reference results, some automation for refreshing when needed, and a
documented way of reproducing them.


Thoughts?


Evgeni


From matthew.brett at gmail.com  Sat Mar 12 00:16:12 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Fri, 11 Mar 2016 21:16:12 -0800
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
Message-ID: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>

Hi,

I just noticed that our (neuroimaging) test suite fails for scipy 0.17.0:

https://travis-ci.org/matthew-brett/nipy/jobs/115471563

This turns out to be due to small changes in the result of
`scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform
- the travis Ubuntu in this case.

Is that expected?   I haven't investigated deeply, but I believe the
0.16.1 result (back into the depths of time) is slightly more
accurate.  Is that possible?

Thanks for any insight,

Matthew


From ralf.gommers at gmail.com  Sun Mar 13 06:43:18 2016
From: ralf.gommers at gmail.com (Ralf Gommers)
Date: Sun, 13 Mar 2016 11:43:18 +0100
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
Message-ID: <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>

On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett <matthew.brett at gmail.com>
wrote:

> Hi,
>
> I just noticed that our (neuroimaging) test suite fails for scipy 0.17.0:
>
> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>
> This turns out to be due to small changes in the result of
> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform
> - the travis Ubuntu in this case.
>
> Is that expected?   I haven't investigated deeply, but I believe the
> 0.16.1 result (back into the depths of time) is slightly more
> accurate.  Is that possible?
>

That's very well possible, because we changed the underlying LAPACK routine
that's used:
https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements.
This PR also had some lstsq precision issues, so could be useful:
https://github.com/scipy/scipy/pull/5550

What is the accuracy difference in your case?

Ralf


>
> Thanks for any insight,
>
> Matthew
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160313/636c9d07/attachment.html>

From x3n0cr4735 at gmail.com  Sun Mar 13 16:28:28 2016
From: x3n0cr4735 at gmail.com (Jan-Samuel Wagner)
Date: Sun, 13 Mar 2016 16:28:28 -0400
Subject: [SciPy-Dev] Gaussian Process hyperparameter tuning
Message-ID: <CAB51ey-+JzsLAh53z4kn+C-Sw3UkwOqe4ejEKRdFDUDN=M7Gww@mail.gmail.com>

Is anyone busy developing a Gaussian Process strategy for hype-parameter
tuning? Seems strange there is only GridSearchCV and RandomSearchCV methods
for sklearn.

Please let me know. I would love to develop this feature.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160313/706f1a14/attachment.html>

From davidmenhur at gmail.com  Sun Mar 13 17:45:25 2016
From: davidmenhur at gmail.com (=?UTF-8?B?RGHPgGlk?=)
Date: Sun, 13 Mar 2016 22:45:25 +0100
Subject: [SciPy-Dev] Gaussian Process hyperparameter tuning
In-Reply-To: <CAB51ey-+JzsLAh53z4kn+C-Sw3UkwOqe4ejEKRdFDUDN=M7Gww@mail.gmail.com>
References: <CAB51ey-+JzsLAh53z4kn+C-Sw3UkwOqe4ejEKRdFDUDN=M7Gww@mail.gmail.com>
Message-ID: <CAJhcF=0sPdjhvf4gSr5MCNu=txT+d+fpFNhcT8FRLb_8uH1_ew@mail.gmail.com>

On 13 March 2016 at 21:28, Jan-Samuel Wagner <x3n0cr4735 at gmail.com> wrote:
> Is anyone busy developing a Gaussian Process strategy for hype-parameter
> tuning? Seems strange there is only GridSearchCV and RandomSearchCV methods
> for sklearn.

The GPy package is pretty feature rich, and more complete than what is
in sklearn.

https://gpy.readthedocs.org/en/master/

They have implemented a gradient descent and a minibatch, and also
many different kernels, sparse GP, etc.


From gael.varoquaux at normalesup.org  Sun Mar 13 18:37:54 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sun, 13 Mar 2016 23:37:54 +0100
Subject: [SciPy-Dev] Gaussian Process hyperparameter tuning
In-Reply-To: <CAB51ey-+JzsLAh53z4kn+C-Sw3UkwOqe4ejEKRdFDUDN=M7Gww@mail.gmail.com>
References: <CAB51ey-+JzsLAh53z4kn+C-Sw3UkwOqe4ejEKRdFDUDN=M7Gww@mail.gmail.com>
Message-ID: <20160313223754.GC474178@phare.normalesup.org>

On Sun, Mar 13, 2016 at 04:28:28PM -0400, Jan-Samuel Wagner wrote:
> Is anyone busy developing a Gaussian Process strategy for hype-parameter
> tuning? Seems strange there is only GridSearchCV and RandomSearchCV methods for
> sklearn.

Such question is probably better asked on the scikit-learn mailing list:
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

This feature is being worked upon:
https://github.com/scikit-learn/scikit-learn/pull/5491

Cheers,

Ga?l


From matthew.brett at gmail.com  Wed Mar 16 20:05:03 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Wed, 16 Mar 2016 17:05:03 -0700
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
Message-ID: <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>

Hi,

On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <ralf.gommers at gmail.com> wrote:
>
>
> On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>>
>> Hi,
>>
>> I just noticed that our (neuroimaging) test suite fails for scipy 0.17.0:
>>
>> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>
>> This turns out to be due to small changes in the result of
>> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform
>> - the travis Ubuntu in this case.
>>
>> Is that expected?   I haven't investigated deeply, but I believe the
>> 0.16.1 result (back into the depths of time) is slightly more
>> accurate.  Is that possible?
>
>
> That's very well possible, because we changed the underlying LAPACK routine
> that's used:
> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements.
> This PR also had some lstsq precision issues, so could be useful:
> https://github.com/scipy/scipy/pull/5550
>
> What is the accuracy difference in your case?

Damn - I thought you might ask me that!

The test where it showed up comes from some calculations on a simple
regression problem.

The regression model is $y = X B + e$, with ordinary least squares fit
given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.

For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
pinv, compared to scipy 0.17 pinv:

# SSE using high-precision sympy pinv
72.0232781366615313
# SSE using scipy 0.16 pinv
72.0232781366625261
# SSE using scipy 0.17 pinv
72.0232781390480170

See: https://github.com/matthew-brett/scipy-pinv/blob/master/pinv_futz.py

A difference at the 9th decimal place - but this was enough to give a
difference at the 5th decimal place in a more involved calculation
using the same pinv result:

https://github.com/nipy/nipy/pull/394/files

Does that help?

Cheers,

Matthew


From charlesr.harris at gmail.com  Wed Mar 16 20:41:16 2016
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Wed, 16 Mar 2016 18:41:16 -0600
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
Message-ID: <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>

On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett <matthew.brett at gmail.com>
wrote:

> Hi,
>
> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >
> >
> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett <matthew.brett at gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> I just noticed that our (neuroimaging) test suite fails for scipy
> 0.17.0:
> >>
> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
> >>
> >> This turns out to be due to small changes in the result of
> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform
> >> - the travis Ubuntu in this case.
> >>
> >> Is that expected?   I haven't investigated deeply, but I believe the
> >> 0.16.1 result (back into the depths of time) is slightly more
> >> accurate.  Is that possible?
> >
> >
> > That's very well possible, because we changed the underlying LAPACK
> routine
> > that's used:
> >
> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
> .
> > This PR also had some lstsq precision issues, so could be useful:
> > https://github.com/scipy/scipy/pull/5550
> >
> > What is the accuracy difference in your case?
>
> Damn - I thought you might ask me that!
>
> The test where it showed up comes from some calculations on a simple
> regression problem.
>
> The regression model is $y = X B + e$, with ordinary least squares fit
> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>
> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
> pinv, compared to scipy 0.17 pinv:
>
> # SSE using high-precision sympy pinv
> 72.0232781366615313
> # SSE using scipy 0.16 pinv
> 72.0232781366625261
> # SSE using scipy 0.17 pinv
> 72.0232781390480170
>

That difference looks pretty big, looks almost like a cutoff effect. What
is the matrix X, in particular, what is its condition number or singular
values? Might also be interesting to see residuals for the different
functions.

<snip>

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160316/61d0ef2d/attachment.html>

From charlesr.harris at gmail.com  Wed Mar 16 20:48:26 2016
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Wed, 16 Mar 2016 18:48:26 -0600
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
Message-ID: <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>

On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>
>> Hi,
>>
>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <ralf.gommers at gmail.com>
>> wrote:
>> >
>> >
>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett <matthew.brett at gmail.com
>> >
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>> 0.17.0:
>> >>
>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>> >>
>> >> This turns out to be due to small changes in the result of
>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform
>> >> - the travis Ubuntu in this case.
>> >>
>> >> Is that expected?   I haven't investigated deeply, but I believe the
>> >> 0.16.1 result (back into the depths of time) is slightly more
>> >> accurate.  Is that possible?
>> >
>> >
>> > That's very well possible, because we changed the underlying LAPACK
>> routine
>> > that's used:
>> >
>> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
>> .
>> > This PR also had some lstsq precision issues, so could be useful:
>> > https://github.com/scipy/scipy/pull/5550
>> >
>> > What is the accuracy difference in your case?
>>
>> Damn - I thought you might ask me that!
>>
>> The test where it showed up comes from some calculations on a simple
>> regression problem.
>>
>> The regression model is $y = X B + e$, with ordinary least squares fit
>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>>
>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>> pinv, compared to scipy 0.17 pinv:
>>
>> # SSE using high-precision sympy pinv
>> 72.0232781366615313
>> # SSE using scipy 0.16 pinv
>> 72.0232781366625261
>> # SSE using scipy 0.17 pinv
>> 72.0232781390480170
>>
>
> That difference looks pretty big, looks almost like a cutoff effect. What
> is the matrix X, in particular, what is its condition number or singular
> values? Might also be interesting to see residuals for the different
> functions.
>
> <snip>
>
>
Note that this might be an indication that the result of the calculation
has an intrinsically large error bar.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160316/08e02fa5/attachment.html>

From matthew.brett at gmail.com  Wed Mar 16 21:16:58 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Wed, 16 Mar 2016 18:16:58 -0700
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
Message-ID: <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>

Hi,

On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>>
>> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <ralf.gommers at gmail.com>
>>> wrote:
>>> >
>>> >
>>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>>> > <matthew.brett at gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>>> >> 0.17.0:
>>> >>
>>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>> >>
>>> >> This turns out to be due to small changes in the result of
>>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform
>>> >> - the travis Ubuntu in this case.
>>> >>
>>> >> Is that expected?   I haven't investigated deeply, but I believe the
>>> >> 0.16.1 result (back into the depths of time) is slightly more
>>> >> accurate.  Is that possible?
>>> >
>>> >
>>> > That's very well possible, because we changed the underlying LAPACK
>>> > routine
>>> > that's used:
>>> >
>>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements.
>>> > This PR also had some lstsq precision issues, so could be useful:
>>> > https://github.com/scipy/scipy/pull/5550
>>> >
>>> > What is the accuracy difference in your case?
>>>
>>> Damn - I thought you might ask me that!
>>>
>>> The test where it showed up comes from some calculations on a simple
>>> regression problem.
>>>
>>> The regression model is $y = X B + e$, with ordinary least squares fit
>>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>>>
>>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>>> pinv, compared to scipy 0.17 pinv:
>>>
>>> # SSE using high-precision sympy pinv
>>> 72.0232781366615313
>>> # SSE using scipy 0.16 pinv
>>> 72.0232781366625261
>>> # SSE using scipy 0.17 pinv
>>> 72.0232781390480170
>>
>>
>> That difference looks pretty big, looks almost like a cutoff effect. What
>> is the matrix X, in particular, what is its condition number or singular
>> values? Might also be interesting to see residuals for the different
>> functions.
>>
>> <snip>
>>
>
> Note that this might be an indication that the result of the calculation has
> an intrinsically large error bar.

In [6]: np.linalg.svd(X, compute_uv=False)
Out[6]:
array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
         2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
         3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
         1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
         1.72675581e-03,   9.51576795e-04])

In [7]: np.linalg.cond(X)
Out[7]: 73278544935.300507

I wasn't sure what you meant about residuals from different functions
- can you explain more?

Cheers,

Matthew


From matthew.brett at gmail.com  Wed Mar 16 22:13:18 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Wed, 16 Mar 2016 19:13:18 -0700
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
Message-ID: <CAH6Pt5rSQqdej1g9_bX55FzxYTSKPj8PNbcgr4vTjQrrBb1V_Q@mail.gmail.com>

On Wed, Mar 16, 2016 at 6:16 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
> Hi,
>
> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>> On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>>>
>>>
>>>
>>> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett <matthew.brett at gmail.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <ralf.gommers at gmail.com>
>>>> wrote:
>>>> >
>>>> >
>>>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>>>> > <matthew.brett at gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>>>> >> 0.17.0:
>>>> >>
>>>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>>> >>
>>>> >> This turns out to be due to small changes in the result of
>>>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same platform
>>>> >> - the travis Ubuntu in this case.
>>>> >>
>>>> >> Is that expected?   I haven't investigated deeply, but I believe the
>>>> >> 0.16.1 result (back into the depths of time) is slightly more
>>>> >> accurate.  Is that possible?
>>>> >
>>>> >
>>>> > That's very well possible, because we changed the underlying LAPACK
>>>> > routine
>>>> > that's used:
>>>> >
>>>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements.
>>>> > This PR also had some lstsq precision issues, so could be useful:
>>>> > https://github.com/scipy/scipy/pull/5550
>>>> >
>>>> > What is the accuracy difference in your case?
>>>>
>>>> Damn - I thought you might ask me that!
>>>>
>>>> The test where it showed up comes from some calculations on a simple
>>>> regression problem.
>>>>
>>>> The regression model is $y = X B + e$, with ordinary least squares fit
>>>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>>>>
>>>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>>>> pinv, compared to scipy 0.17 pinv:
>>>>
>>>> # SSE using high-precision sympy pinv
>>>> 72.0232781366615313
>>>> # SSE using scipy 0.16 pinv
>>>> 72.0232781366625261
>>>> # SSE using scipy 0.17 pinv
>>>> 72.0232781390480170
>>>
>>>
>>> That difference looks pretty big, looks almost like a cutoff effect. What
>>> is the matrix X, in particular, what is its condition number or singular
>>> values? Might also be interesting to see residuals for the different
>>> functions.
>>>
>>> <snip>
>>>
>>
>> Note that this might be an indication that the result of the calculation has
>> an intrinsically large error bar.
>
> In [6]: np.linalg.svd(X, compute_uv=False)
> Out[6]:
> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>          1.72675581e-03,   9.51576795e-04])
>
> In [7]: np.linalg.cond(X)
> Out[7]: 73278544935.300507

Although, the differences in residual error are at the 12th decimal
place comparing the sum of squared error for a full-rank vs
not-full-rank matrix:

In [2]: import scipy.linalg as spl
In [3]: X_redundant = np.c_[X, X[:, 0]]
In [5]: np.linalg.cond(X_redundant)
Out[5]: 4.6338774550388212e+23
In [8]: print_scalar(sse(X, spl.pinv(X), y))
72.0232781366634640
In [9]: print_scalar(sse(X_redundant, spl.pinv(X_redundant), y))
72.0232781366614461

Cheers,

Matthew


From charlesr.harris at gmail.com  Wed Mar 16 23:01:35 2016
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Wed, 16 Mar 2016 21:01:35 -0600
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
Message-ID: <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>

On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett <matthew.brett at gmail.com>
wrote:

> Hi,
>
> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
> > <charlesr.harris at gmail.com> wrote:
> >>
> >>
> >>
> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett <matthew.brett at gmail.com
> >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <ralf.gommers at gmail.com>
> >>> wrote:
> >>> >
> >>> >
> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
> >>> > <matthew.brett at gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy
> >>> >> 0.17.0:
> >>> >>
> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
> >>> >>
> >>> >> This turns out to be due to small changes in the result of
> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
> platform
> >>> >> - the travis Ubuntu in this case.
> >>> >>
> >>> >> Is that expected?   I haven't investigated deeply, but I believe the
> >>> >> 0.16.1 result (back into the depths of time) is slightly more
> >>> >> accurate.  Is that possible?
> >>> >
> >>> >
> >>> > That's very well possible, because we changed the underlying LAPACK
> >>> > routine
> >>> > that's used:
> >>> >
> >>> >
> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
> .
> >>> > This PR also had some lstsq precision issues, so could be useful:
> >>> > https://github.com/scipy/scipy/pull/5550
> >>> >
> >>> > What is the accuracy difference in your case?
> >>>
> >>> Damn - I thought you might ask me that!
> >>>
> >>> The test where it showed up comes from some calculations on a simple
> >>> regression problem.
> >>>
> >>> The regression model is $y = X B + e$, with ordinary least squares fit
> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
> >>>
> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
> >>> pinv, compared to scipy 0.17 pinv:
> >>>
> >>> # SSE using high-precision sympy pinv
> >>> 72.0232781366615313
> >>> # SSE using scipy 0.16 pinv
> >>> 72.0232781366625261
> >>> # SSE using scipy 0.17 pinv
> >>> 72.0232781390480170
> >>
> >>
> >> That difference looks pretty big, looks almost like a cutoff effect.
> What
> >> is the matrix X, in particular, what is its condition number or singular
> >> values? Might also be interesting to see residuals for the different
> >> functions.
> >>
> >> <snip>
> >>
> >
> > Note that this might be an indication that the result of the calculation
> has
> > an intrinsically large error bar.
>
> In [6]: np.linalg.svd(X, compute_uv=False)
> Out[6]:
> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>          1.72675581e-03,   9.51576795e-04])
>
> In [7]: np.linalg.cond(X)
> Out[7]: 73278544935.300507
>

Condition number is about 7e10, so you might expect to lose about 10 digits
of of accuracy just due to roundoff.  The covariance of the fitting error
will be `inv(X.T, X) * error`, so you can look at the diagonals for the
variance of the fitted parameters just for kicks, taking `error` to be
roundoff error or measurement error. This is not to say that the results
should not be more reproducible, but should be of interest for the validity
of the solution. Note that I suspect that there may be a  difference in the
default `rcond` used, or maybe the matrix is on the edge and a small change
in results changes the effective rank.  You can run with the
`return_rank=True` argument to check that. Looks like machine precision is
used for the default value of rcond, so experimenting with that probable
won't be informative, but you might want to try it anyway.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160316/b6301ef9/attachment.html>

From josef.pktd at gmail.com  Wed Mar 16 23:39:29 2016
From: josef.pktd at gmail.com (josef.pktd at gmail.com)
Date: Wed, 16 Mar 2016 23:39:29 -0400
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
Message-ID: <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>

On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris <
charlesr.harris at gmail.com> wrote:

>
>
> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>
>> Hi,
>>
>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>> >
>> >
>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>> > <charlesr.harris at gmail.com> wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett <
>> matthew.brett at gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <ralf.gommers at gmail.com
>> >
>> >>> wrote:
>> >>> >
>> >>> >
>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>> >>> > <matthew.brett at gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi,
>> >>> >>
>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>> >>> >> 0.17.0:
>> >>> >>
>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>> >>> >>
>> >>> >> This turns out to be due to small changes in the result of
>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
>> platform
>> >>> >> - the travis Ubuntu in this case.
>> >>> >>
>> >>> >> Is that expected?   I haven't investigated deeply, but I believe
>> the
>> >>> >> 0.16.1 result (back into the depths of time) is slightly more
>> >>> >> accurate.  Is that possible?
>> >>> >
>> >>> >
>> >>> > That's very well possible, because we changed the underlying LAPACK
>> >>> > routine
>> >>> > that's used:
>> >>> >
>> >>> >
>> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
>> .
>> >>> > This PR also had some lstsq precision issues, so could be useful:
>> >>> > https://github.com/scipy/scipy/pull/5550
>> >>> >
>> >>> > What is the accuracy difference in your case?
>> >>>
>> >>> Damn - I thought you might ask me that!
>> >>>
>> >>> The test where it showed up comes from some calculations on a simple
>> >>> regression problem.
>> >>>
>> >>> The regression model is $y = X B + e$, with ordinary least squares fit
>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>> >>>
>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>> >>> pinv, compared to scipy 0.17 pinv:
>> >>>
>> >>> # SSE using high-precision sympy pinv
>> >>> 72.0232781366615313
>> >>> # SSE using scipy 0.16 pinv
>> >>> 72.0232781366625261
>> >>> # SSE using scipy 0.17 pinv
>> >>> 72.0232781390480170
>> >>
>> >>
>> >> That difference looks pretty big, looks almost like a cutoff effect.
>> What
>> >> is the matrix X, in particular, what is its condition number or
>> singular
>> >> values? Might also be interesting to see residuals for the different
>> >> functions.
>> >>
>> >> <snip>
>> >>
>> >
>> > Note that this might be an indication that the result of the
>> calculation has
>> > an intrinsically large error bar.
>>
>> In [6]: np.linalg.svd(X, compute_uv=False)
>> Out[6]:
>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>>          1.72675581e-03,   9.51576795e-04])
>>
>> In [7]: np.linalg.cond(X)
>> Out[7]: 73278544935.300507
>>
>
> Condition number is about 7e10, so you might expect to lose about 10
> digits of of accuracy just due to roundoff.  The covariance of the fitting
> error will be `inv(X.T, X) * error`, so you can look at the diagonals for
> the variance of the fitted parameters just for kicks, taking `error` to be
> roundoff error or measurement error. This is not to say that the results
> should not be more reproducible, but should be of interest for the validity
> of the solution. Note that I suspect that there may be a  difference in the
> default `rcond` used, or maybe the matrix is on the edge and a small change
> in results changes the effective rank.  You can run with the
> `return_rank=True` argument to check that. Looks like machine precision is
> used for the default value of rcond, so experimenting with that probable
> won't be informative, but you might want to try it anyway.
>


statsmodels is using numpy pinv for the linear model and it works in almost
all cases very well *).

Using pinv with singular (given the rcond threshold) X makes everything
well defined and not subject to numerical noise because of the
regularization induced by pinv.  Our covariance for the parameter
estimates, based on pinv(X), is also not noisy but it's of reduced rank.

The main problem I have seen so far is that the default rcond is too small
for some cases, and it breaks down with near singular X with rcond a bit
larger than 1e-15 in an example with dummy variables.

https://github.com/statsmodels/statsmodels/issues/2628
http://stackoverflow.com/questions/32599159/robustness-issue-of-statsmodel-linear-regression-ols-python


If I think a bit longer, then I remember two other cases where pinv with
ill conditioned X also doesn't work very well with continuous almost
collinear data, one NIST example with badly scaled polynomials where
precision goes down to a few decimals and a very small economic dataset
where the normal equations (or orthogonality conditions?) are not satisfied
at a good precision..

Josef


>
> Chuck
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160316/e118fa5e/attachment.html>

From charlesr.harris at gmail.com  Wed Mar 16 23:49:29 2016
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Wed, 16 Mar 2016 21:49:29 -0600
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
Message-ID: <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>

On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:

>
>
> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
>>
>>
>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>>> <charlesr.harris at gmail.com> wrote:
>>> >
>>> >
>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>>> > <charlesr.harris at gmail.com> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett <
>>> matthew.brett at gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <
>>> ralf.gommers at gmail.com>
>>> >>> wrote:
>>> >>> >
>>> >>> >
>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>>> >>> > <matthew.brett at gmail.com>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> Hi,
>>> >>> >>
>>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>>> >>> >> 0.17.0:
>>> >>> >>
>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>> >>> >>
>>> >>> >> This turns out to be due to small changes in the result of
>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
>>> platform
>>> >>> >> - the travis Ubuntu in this case.
>>> >>> >>
>>> >>> >> Is that expected?   I haven't investigated deeply, but I believe
>>> the
>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more
>>> >>> >> accurate.  Is that possible?
>>> >>> >
>>> >>> >
>>> >>> > That's very well possible, because we changed the underlying LAPACK
>>> >>> > routine
>>> >>> > that's used:
>>> >>> >
>>> >>> >
>>> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
>>> .
>>> >>> > This PR also had some lstsq precision issues, so could be useful:
>>> >>> > https://github.com/scipy/scipy/pull/5550
>>> >>> >
>>> >>> > What is the accuracy difference in your case?
>>> >>>
>>> >>> Damn - I thought you might ask me that!
>>> >>>
>>> >>> The test where it showed up comes from some calculations on a simple
>>> >>> regression problem.
>>> >>>
>>> >>> The regression model is $y = X B + e$, with ordinary least squares
>>> fit
>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>>> >>>
>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>>> >>> pinv, compared to scipy 0.17 pinv:
>>> >>>
>>> >>> # SSE using high-precision sympy pinv
>>> >>> 72.0232781366615313
>>> >>> # SSE using scipy 0.16 pinv
>>> >>> 72.0232781366625261
>>> >>> # SSE using scipy 0.17 pinv
>>> >>> 72.0232781390480170
>>> >>
>>> >>
>>> >> That difference looks pretty big, looks almost like a cutoff effect.
>>> What
>>> >> is the matrix X, in particular, what is its condition number or
>>> singular
>>> >> values? Might also be interesting to see residuals for the different
>>> >> functions.
>>> >>
>>> >> <snip>
>>> >>
>>> >
>>> > Note that this might be an indication that the result of the
>>> calculation has
>>> > an intrinsically large error bar.
>>>
>>> In [6]: np.linalg.svd(X, compute_uv=False)
>>> Out[6]:
>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>>>          1.72675581e-03,   9.51576795e-04])
>>>
>>> In [7]: np.linalg.cond(X)
>>> Out[7]: 73278544935.300507
>>>
>>
>> Condition number is about 7e10, so you might expect to lose about 10
>> digits of of accuracy just due to roundoff.  The covariance of the fitting
>> error will be `inv(X.T, X) * error`, so you can look at the diagonals for
>> the variance of the fitted parameters just for kicks, taking `error` to be
>> roundoff error or measurement error. This is not to say that the results
>> should not be more reproducible, but should be of interest for the validity
>> of the solution. Note that I suspect that there may be a  difference in the
>> default `rcond` used, or maybe the matrix is on the edge and a small change
>> in results changes the effective rank.  You can run with the
>> `return_rank=True` argument to check that. Looks like machine precision is
>> used for the default value of rcond, so experimenting with that probable
>> won't be informative, but you might want to try it anyway.
>>
>
>
> statsmodels is using numpy pinv for the linear model and it works in
> almost all cases very well *).
>
> Using pinv with singular (given the rcond threshold) X makes everything
> well defined and not subject to numerical noise because of the
> regularization induced by pinv.  Our covariance for the parameter
> estimates, based on pinv(X), is also not noisy but it's of reduced rank.
>

What I'm thinking is that changing the algorithm, as here, is like
introducing noise relative to the original, so even if the result is
perfectly valid, the fitted parameters might change significantly. A change
in the 11-th digit wouldn't surprise me at all, depending on the actual X
and rhs.


>
> The main problem I have seen so far is that the default rcond is too small
> for some cases, and it breaks down with near singular X with rcond a bit
> larger than 1e-15 in an example with dummy variables.
>
> https://github.com/statsmodels/statsmodels/issues/2628
>
> http://stackoverflow.com/questions/32599159/robustness-issue-of-statsmodel-linear-regression-ols-python
>
>
> If I think a bit longer, then I remember two other cases where pinv with
> ill conditioned X also doesn't work very well with continuous almost
> collinear data, one NIST example with badly scaled polynomials where
> precision goes down to a few decimals and a very small economic dataset
> where the normal equations (or orthogonality conditions?) are not satisfied
> at a good precision..
>
>
Yeah, it can make a real mess with polynomial fits ;)

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160316/4a7bf9ff/attachment.html>

From charlesr.harris at gmail.com  Wed Mar 16 23:57:22 2016
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Wed, 16 Mar 2016 21:57:22 -0600
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
Message-ID: <CAB6mnx+NxLHYgN=d6=OcOTBJN-foks1jbaqzmx2+C23N3Ymqug@mail.gmail.com>

On Wed, Mar 16, 2016 at 9:49 PM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
> On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
>
>>
>>
>> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris <
>> charlesr.harris at gmail.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett <matthew.brett at gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>>>> <charlesr.harris at gmail.com> wrote:
>>>> >
>>>> >
>>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>>>> > <charlesr.harris at gmail.com> wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett <
>>>> matthew.brett at gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers <
>>>> ralf.gommers at gmail.com>
>>>> >>> wrote:
>>>> >>> >
>>>> >>> >
>>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>>>> >>> > <matthew.brett at gmail.com>
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> Hi,
>>>> >>> >>
>>>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>>>> >>> >> 0.17.0:
>>>> >>> >>
>>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>>> >>> >>
>>>> >>> >> This turns out to be due to small changes in the result of
>>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
>>>> platform
>>>> >>> >> - the travis Ubuntu in this case.
>>>> >>> >>
>>>> >>> >> Is that expected?   I haven't investigated deeply, but I believe
>>>> the
>>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more
>>>> >>> >> accurate.  Is that possible?
>>>> >>> >
>>>> >>> >
>>>> >>> > That's very well possible, because we changed the underlying
>>>> LAPACK
>>>> >>> > routine
>>>> >>> > that's used:
>>>> >>> >
>>>> >>> >
>>>> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
>>>> .
>>>> >>> > This PR also had some lstsq precision issues, so could be useful:
>>>> >>> > https://github.com/scipy/scipy/pull/5550
>>>> >>> >
>>>> >>> > What is the accuracy difference in your case?
>>>> >>>
>>>> >>> Damn - I thought you might ask me that!
>>>> >>>
>>>> >>> The test where it showed up comes from some calculations on a simple
>>>> >>> regression problem.
>>>> >>>
>>>> >>> The regression model is $y = X B + e$, with ordinary least squares
>>>> fit
>>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>>>> >>>
>>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>>>> >>> pinv, compared to scipy 0.17 pinv:
>>>> >>>
>>>> >>> # SSE using high-precision sympy pinv
>>>> >>> 72.0232781366615313
>>>> >>> # SSE using scipy 0.16 pinv
>>>> >>> 72.0232781366625261
>>>> >>> # SSE using scipy 0.17 pinv
>>>> >>> 72.0232781390480170
>>>> >>
>>>> >>
>>>> >> That difference looks pretty big, looks almost like a cutoff effect.
>>>> What
>>>> >> is the matrix X, in particular, what is its condition number or
>>>> singular
>>>> >> values? Might also be interesting to see residuals for the different
>>>> >> functions.
>>>> >>
>>>> >> <snip>
>>>> >>
>>>> >
>>>> > Note that this might be an indication that the result of the
>>>> calculation has
>>>> > an intrinsically large error bar.
>>>>
>>>> In [6]: np.linalg.svd(X, compute_uv=False)
>>>> Out[6]:
>>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>>>>          1.72675581e-03,   9.51576795e-04])
>>>>
>>>> In [7]: np.linalg.cond(X)
>>>> Out[7]: 73278544935.300507
>>>>
>>>
>>> Condition number is about 7e10, so you might expect to lose about 10
>>> digits of of accuracy just due to roundoff.  The covariance of the fitting
>>> error will be `inv(X.T, X) * error`, so you can look at the diagonals for
>>> the variance of the fitted parameters just for kicks, taking `error` to be
>>> roundoff error or measurement error. This is not to say that the results
>>> should not be more reproducible, but should be of interest for the validity
>>> of the solution. Note that I suspect that there may be a  difference in the
>>> default `rcond` used, or maybe the matrix is on the edge and a small change
>>> in results changes the effective rank.  You can run with the
>>> `return_rank=True` argument to check that. Looks like machine precision is
>>> used for the default value of rcond, so experimenting with that probable
>>> won't be informative, but you might want to try it anyway.
>>>
>>
>>
>> statsmodels is using numpy pinv for the linear model and it works in
>> almost all cases very well *).
>>
>> Using pinv with singular (given the rcond threshold) X makes everything
>> well defined and not subject to numerical noise because of the
>> regularization induced by pinv.  Our covariance for the parameter
>> estimates, based on pinv(X), is also not noisy but it's of reduced rank.
>>
>
> What I'm thinking is that changing the algorithm, as here, is like
> introducing noise relative to the original, so even if the result is
> perfectly valid, the fitted parameters might change significantly. A change
> in the 11-th digit wouldn't surprise me at all, depending on the actual X
> and rhs.
>
>
>>
>> The main problem I have seen so far is that the default rcond is too
>> small for some cases, and it breaks down with near singular X with rcond a
>> bit larger than 1e-15 in an example with dummy variables.
>>
>> https://github.com/statsmodels/statsmodels/issues/2628
>>
>> http://stackoverflow.com/questions/32599159/robustness-issue-of-statsmodel-linear-regression-ols-python
>>
>>
>> If I think a bit longer, then I remember two other cases where pinv with
>> ill conditioned X also doesn't work very well with continuous almost
>> collinear data, one NIST example with badly scaled polynomials where
>> precision goes down to a few decimals and a very small economic dataset
>> where the normal equations (or orthogonality conditions?) are not satisfied
>> at a good precision..
>>
>>
> Yeah, it can make a real mess with polynomial fits ;)
>

The classic rant on these matters is in Numerical Methods that Work.
<http://www.amazon.com/Numerical-Methods-that-Work-Spectrum/dp/0883854503>

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160316/1e52ea92/attachment.html>

From matthew.brett at gmail.com  Thu Mar 17 03:19:37 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Thu, 17 Mar 2016 00:19:37 -0700
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
Message-ID: <CAH6Pt5qys2UwtF52wAnOwYsMHH__f3bSHu1-qiVEbdLG+zCoQQ@mail.gmail.com>

Hi,

On Wed, Mar 16, 2016 at 8:39 PM,  <josef.pktd at gmail.com> wrote:
>
>
> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>>
>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>>> <charlesr.harris at gmail.com> wrote:
>>> >
>>> >
>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>>> > <charlesr.harris at gmail.com> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
>>> >> <matthew.brett at gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
>>> >>> <ralf.gommers at gmail.com>
>>> >>> wrote:
>>> >>> >
>>> >>> >
>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>>> >>> > <matthew.brett at gmail.com>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> Hi,
>>> >>> >>
>>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>>> >>> >> 0.17.0:
>>> >>> >>
>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>> >>> >>
>>> >>> >> This turns out to be due to small changes in the result of
>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
>>> >>> >> platform
>>> >>> >> - the travis Ubuntu in this case.
>>> >>> >>
>>> >>> >> Is that expected?   I haven't investigated deeply, but I believe
>>> >>> >> the
>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more
>>> >>> >> accurate.  Is that possible?
>>> >>> >
>>> >>> >
>>> >>> > That's very well possible, because we changed the underlying LAPACK
>>> >>> > routine
>>> >>> > that's used:
>>> >>> >
>>> >>> >
>>> >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements.
>>> >>> > This PR also had some lstsq precision issues, so could be useful:
>>> >>> > https://github.com/scipy/scipy/pull/5550
>>> >>> >
>>> >>> > What is the accuracy difference in your case?
>>> >>>
>>> >>> Damn - I thought you might ask me that!
>>> >>>
>>> >>> The test where it showed up comes from some calculations on a simple
>>> >>> regression problem.
>>> >>>
>>> >>> The regression model is $y = X B + e$, with ordinary least squares
>>> >>> fit
>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>>> >>>
>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>>> >>> pinv, compared to scipy 0.17 pinv:
>>> >>>
>>> >>> # SSE using high-precision sympy pinv
>>> >>> 72.0232781366615313
>>> >>> # SSE using scipy 0.16 pinv
>>> >>> 72.0232781366625261
>>> >>> # SSE using scipy 0.17 pinv
>>> >>> 72.0232781390480170
>>> >>
>>> >>
>>> >> That difference looks pretty big, looks almost like a cutoff effect.
>>> >> What
>>> >> is the matrix X, in particular, what is its condition number or
>>> >> singular
>>> >> values? Might also be interesting to see residuals for the different
>>> >> functions.
>>> >>
>>> >> <snip>
>>> >>
>>> >
>>> > Note that this might be an indication that the result of the
>>> > calculation has
>>> > an intrinsically large error bar.
>>>
>>> In [6]: np.linalg.svd(X, compute_uv=False)
>>> Out[6]:
>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>>>          1.72675581e-03,   9.51576795e-04])
>>>
>>> In [7]: np.linalg.cond(X)
>>> Out[7]: 73278544935.300507
>>
>>
>> Condition number is about 7e10, so you might expect to lose about 10
>> digits of of accuracy just due to roundoff.  The covariance of the fitting
>> error will be `inv(X.T, X) * error`, so you can look at the diagonals for
>> the variance of the fitted parameters just for kicks, taking `error` to be
>> roundoff error or measurement error. This is not to say that the results
>> should not be more reproducible, but should be of interest for the validity
>> of the solution. Note that I suspect that there may be a  difference in the
>> default `rcond` used, or maybe the matrix is on the edge and a small change
>> in results changes the effective rank.  You can run with the
>> `return_rank=True` argument to check that. Looks like machine precision is
>> used for the default value of rcond, so experimenting with that probable
>> won't be informative, but you might want to try it anyway.
>
>
>
> statsmodels is using numpy pinv for the linear model and it works in almost
> all cases very well *).

Yes, the error for numpy.linalg pinv is a lot less:

Sympy high-precision pinv (as before):
72.0232781366602524
Scipy 0.16 pinv (as before, difference at 12th dp):
72.0232781366620856
Scipy 0.17 pinv (as before, difference at 9th dp):
72.0232781390479602
Numpy pinv (difference at 12th dp)
In [2]: print_scalar(sse(X, np.linalg.pinv(X), y))
72.0232781366618440

So numpy pinv is more accurate than scipy pinv for this matrix, for
scipy 0.17, and both scipy 0.16 and numpy differ at the 12th dp, as
does estimation for a singular matrix (previous email).

Incidentally, this is all on scipy / openblas / linux.

On OSX:

Scipy 0.17.0:
72.0232781366634640
Numpy:
72.0232781366609771
Scipy 0.16.0
72.0232781366634640

So - on OSX - differences also predictably at 12th dp for both scipy
versions and numpy.

It does seem that scipy 0.17 on Linux is the outlier.

Matthew


From olivier.grisel at ensta.org  Thu Mar 17 05:30:24 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Thu, 17 Mar 2016 10:30:24 +0100
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAH6Pt5qys2UwtF52wAnOwYsMHH__f3bSHu1-qiVEbdLG+zCoQQ@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAH6Pt5qys2UwtF52wAnOwYsMHH__f3bSHu1-qiVEbdLG+zCoQQ@mail.gmail.com>
Message-ID: <CAFvE7K5nKRvKgzJyn9JjaPAbs0pL8GKa=AzUU_sPJZHcCzvKEw@mail.gmail.com>

Can you check with Atlas on linux then?

-- 
Olivier


From matthew.brett at gmail.com  Thu Mar 17 15:30:28 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Thu, 17 Mar 2016 12:30:28 -0700
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAFvE7K5nKRvKgzJyn9JjaPAbs0pL8GKa=AzUU_sPJZHcCzvKEw@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAH6Pt5qys2UwtF52wAnOwYsMHH__f3bSHu1-qiVEbdLG+zCoQQ@mail.gmail.com>
 <CAFvE7K5nKRvKgzJyn9JjaPAbs0pL8GKa=AzUU_sPJZHcCzvKEw@mail.gmail.com>
Message-ID: <CAH6Pt5pYHis=3p13FM0qYrDob7e4Aeshy+YpJvRmaTPGC80=DQ@mail.gmail.com>

On Thu, Mar 17, 2016 at 2:30 AM, Olivier Grisel
<olivier.grisel at ensta.org> wrote:
> Can you check with Atlas on linux then?

Yes, same idea, this on another computer, Debian packaged ATLAS:

Sympy high-precision pinv SSE
72.0232781366615313
Local numpy 1.10.4 linalg pinv
72.0232781366614176
Local scipy 0.16.0 linalg pinv
72.0232781366600392
Local scipy 0.17.0 linalg pinv
72.0232781368149233

Difference at 10th dp, but still two orders of magnitude different
from the other variations.

https://github.com/matthew-brett/scipy-pinv/blob/master/pinv_futz.py

Matthew


From matthew.brett at gmail.com  Fri Mar 18 15:07:39 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Fri, 18 Mar 2016 12:07:39 -0700
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
Message-ID: <CAH6Pt5rr6wA1ORmnDD7bF+fbGQ3N_f1_6ig8Uu5N0_Efu0wiqQ@mail.gmail.com>

On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
>>
>>
>>
>> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>>>
>>>
>>>
>>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett <matthew.brett at gmail.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>>>> <charlesr.harris at gmail.com> wrote:
>>>> >
>>>> >
>>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>>>> > <charlesr.harris at gmail.com> wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
>>>> >> <matthew.brett at gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
>>>> >>> <ralf.gommers at gmail.com>
>>>> >>> wrote:
>>>> >>> >
>>>> >>> >
>>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>>>> >>> > <matthew.brett at gmail.com>
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> Hi,
>>>> >>> >>
>>>> >>> >> I just noticed that our (neuroimaging) test suite fails for scipy
>>>> >>> >> 0.17.0:
>>>> >>> >>
>>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>>> >>> >>
>>>> >>> >> This turns out to be due to small changes in the result of
>>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
>>>> >>> >> platform
>>>> >>> >> - the travis Ubuntu in this case.
>>>> >>> >>
>>>> >>> >> Is that expected?   I haven't investigated deeply, but I believe
>>>> >>> >> the
>>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more
>>>> >>> >> accurate.  Is that possible?
>>>> >>> >
>>>> >>> >
>>>> >>> > That's very well possible, because we changed the underlying
>>>> >>> > LAPACK
>>>> >>> > routine
>>>> >>> > that's used:
>>>> >>> >
>>>> >>> >
>>>> >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements.
>>>> >>> > This PR also had some lstsq precision issues, so could be useful:
>>>> >>> > https://github.com/scipy/scipy/pull/5550
>>>> >>> >
>>>> >>> > What is the accuracy difference in your case?
>>>> >>>
>>>> >>> Damn - I thought you might ask me that!
>>>> >>>
>>>> >>> The test where it showed up comes from some calculations on a simple
>>>> >>> regression problem.
>>>> >>>
>>>> >>> The regression model is $y = X B + e$, with ordinary least squares
>>>> >>> fit
>>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
>>>> >>>
>>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy 0.16
>>>> >>> pinv, compared to scipy 0.17 pinv:
>>>> >>>
>>>> >>> # SSE using high-precision sympy pinv
>>>> >>> 72.0232781366615313
>>>> >>> # SSE using scipy 0.16 pinv
>>>> >>> 72.0232781366625261
>>>> >>> # SSE using scipy 0.17 pinv
>>>> >>> 72.0232781390480170
>>>> >>
>>>> >>
>>>> >> That difference looks pretty big, looks almost like a cutoff effect.
>>>> >> What
>>>> >> is the matrix X, in particular, what is its condition number or
>>>> >> singular
>>>> >> values? Might also be interesting to see residuals for the different
>>>> >> functions.
>>>> >>
>>>> >> <snip>
>>>> >>
>>>> >
>>>> > Note that this might be an indication that the result of the
>>>> > calculation has
>>>> > an intrinsically large error bar.
>>>>
>>>> In [6]: np.linalg.svd(X, compute_uv=False)
>>>> Out[6]:
>>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>>>>          1.72675581e-03,   9.51576795e-04])
>>>>
>>>> In [7]: np.linalg.cond(X)
>>>> Out[7]: 73278544935.300507
>>>
>>>
>>> Condition number is about 7e10, so you might expect to lose about 10
>>> digits of of accuracy just due to roundoff.  The covariance of the fitting
>>> error will be `inv(X.T, X) * error`, so you can look at the diagonals for
>>> the variance of the fitted parameters just for kicks, taking `error` to be
>>> roundoff error or measurement error. This is not to say that the results
>>> should not be more reproducible, but should be of interest for the validity
>>> of the solution. Note that I suspect that there may be a  difference in the
>>> default `rcond` used, or maybe the matrix is on the edge and a small change
>>> in results changes the effective rank.  You can run with the
>>> `return_rank=True` argument to check that. Looks like machine precision is
>>> used for the default value of rcond, so experimenting with that probable
>>> won't be informative, but you might want to try it anyway.
>>
>>
>>
>> statsmodels is using numpy pinv for the linear model and it works in
>> almost all cases very well *).
>>
>> Using pinv with singular (given the rcond threshold) X makes everything
>> well defined and not subject to numerical noise because of the
>> regularization induced by pinv.  Our covariance for the parameter estimates,
>> based on pinv(X), is also not noisy but it's of reduced rank.
>
>
> What I'm thinking is that changing the algorithm, as here, is like
> introducing noise relative to the original, so even if the result is
> perfectly valid, the fitted parameters might change significantly. A change
> in the 11-th digit wouldn't surprise me at all, depending on the actual X
> and rhs.

I believe the errors $e = y - X^+ y$ are in general robust to
differences between valid pinv solutions, even in extreme cases:

In [2]: print_scalar(sse(X, npl.pinv(X), y))
72.0232781366616024

In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y))
72.0232781366939463

In [4]: npl.cond(np.c_[X, X])
Out[4]: 6.6247965335098062e+26

Cheers,

Matthew


From josef.pktd at gmail.com  Fri Mar 18 15:19:38 2016
From: josef.pktd at gmail.com (josef.pktd at gmail.com)
Date: Fri, 18 Mar 2016 15:19:38 -0400
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAH6Pt5rr6wA1ORmnDD7bF+fbGQ3N_f1_6ig8Uu5N0_Efu0wiqQ@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
 <CAH6Pt5rr6wA1ORmnDD7bF+fbGQ3N_f1_6ig8Uu5N0_Efu0wiqQ@mail.gmail.com>
Message-ID: <CAMMTP+AmwQb744C_bnZnKGMJHdyE1hUJEjNojT2KorkT8Q7xFA@mail.gmail.com>

On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett <matthew.brett at gmail.com>
wrote:

> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
> >>
> >>
> >>
> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
> >> <charlesr.harris at gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett <
> matthew.brett at gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
> >>>> <charlesr.harris at gmail.com> wrote:
> >>>> >
> >>>> >
> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
> >>>> > <charlesr.harris at gmail.com> wrote:
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
> >>>> >> <matthew.brett at gmail.com>
> >>>> >> wrote:
> >>>> >>>
> >>>> >>> Hi,
> >>>> >>>
> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
> >>>> >>> <ralf.gommers at gmail.com>
> >>>> >>> wrote:
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
> >>>> >>> > <matthew.brett at gmail.com>
> >>>> >>> > wrote:
> >>>> >>> >>
> >>>> >>> >> Hi,
> >>>> >>> >>
> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for
> scipy
> >>>> >>> >> 0.17.0:
> >>>> >>> >>
> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
> >>>> >>> >>
> >>>> >>> >> This turns out to be due to small changes in the result of
> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
> >>>> >>> >> platform
> >>>> >>> >> - the travis Ubuntu in this case.
> >>>> >>> >>
> >>>> >>> >> Is that expected?   I haven't investigated deeply, but I
> believe
> >>>> >>> >> the
> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more
> >>>> >>> >> accurate.  Is that possible?
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > That's very well possible, because we changed the underlying
> >>>> >>> > LAPACK
> >>>> >>> > routine
> >>>> >>> > that's used:
> >>>> >>> >
> >>>> >>> >
> >>>> >>> >
> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
> .
> >>>> >>> > This PR also had some lstsq precision issues, so could be
> useful:
> >>>> >>> > https://github.com/scipy/scipy/pull/5550
> >>>> >>> >
> >>>> >>> > What is the accuracy difference in your case?
> >>>> >>>
> >>>> >>> Damn - I thought you might ask me that!
> >>>> >>>
> >>>> >>> The test where it showed up comes from some calculations on a
> simple
> >>>> >>> regression problem.
> >>>> >>>
> >>>> >>> The regression model is $y = X B + e$, with ordinary least squares
> >>>> >>> fit
> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix X.
> >>>> >>>
> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy
> 0.16
> >>>> >>> pinv, compared to scipy 0.17 pinv:
> >>>> >>>
> >>>> >>> # SSE using high-precision sympy pinv
> >>>> >>> 72.0232781366615313
> >>>> >>> # SSE using scipy 0.16 pinv
> >>>> >>> 72.0232781366625261
> >>>> >>> # SSE using scipy 0.17 pinv
> >>>> >>> 72.0232781390480170
> >>>> >>
> >>>> >>
> >>>> >> That difference looks pretty big, looks almost like a cutoff
> effect.
> >>>> >> What
> >>>> >> is the matrix X, in particular, what is its condition number or
> >>>> >> singular
> >>>> >> values? Might also be interesting to see residuals for the
> different
> >>>> >> functions.
> >>>> >>
> >>>> >> <snip>
> >>>> >>
> >>>> >
> >>>> > Note that this might be an indication that the result of the
> >>>> > calculation has
> >>>> > an intrinsically large error bar.
> >>>>
> >>>> In [6]: np.linalg.svd(X, compute_uv=False)
> >>>> Out[6]:
> >>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
> >>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
> >>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
> >>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
> >>>>          1.72675581e-03,   9.51576795e-04])
> >>>>
> >>>> In [7]: np.linalg.cond(X)
> >>>> Out[7]: 73278544935.300507
> >>>
> >>>
> >>> Condition number is about 7e10, so you might expect to lose about 10
> >>> digits of of accuracy just due to roundoff.  The covariance of the
> fitting
> >>> error will be `inv(X.T, X) * error`, so you can look at the diagonals
> for
> >>> the variance of the fitted parameters just for kicks, taking `error`
> to be
> >>> roundoff error or measurement error. This is not to say that the
> results
> >>> should not be more reproducible, but should be of interest for the
> validity
> >>> of the solution. Note that I suspect that there may be a  difference
> in the
> >>> default `rcond` used, or maybe the matrix is on the edge and a small
> change
> >>> in results changes the effective rank.  You can run with the
> >>> `return_rank=True` argument to check that. Looks like machine
> precision is
> >>> used for the default value of rcond, so experimenting with that
> probable
> >>> won't be informative, but you might want to try it anyway.
> >>
> >>
> >>
> >> statsmodels is using numpy pinv for the linear model and it works in
> >> almost all cases very well *).
> >>
> >> Using pinv with singular (given the rcond threshold) X makes everything
> >> well defined and not subject to numerical noise because of the
> >> regularization induced by pinv.  Our covariance for the parameter
> estimates,
> >> based on pinv(X), is also not noisy but it's of reduced rank.
> >
> >
> > What I'm thinking is that changing the algorithm, as here, is like
> > introducing noise relative to the original, so even if the result is
> > perfectly valid, the fitted parameters might change significantly. A
> change
> > in the 11-th digit wouldn't surprise me at all, depending on the actual X
> > and rhs.
>
> I believe the errors $e = y - X^+ y$ are in general robust to
> differences between valid pinv solutions, even in extreme cases:
>
> In [2]: print_scalar(sse(X, npl.pinv(X), y))
> 72.0232781366616024
>
> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y))
> 72.0232781366939463
>
> In [4]: npl.cond(np.c_[X, X])
> Out[4]: 6.6247965335098062e+26
>


I think this does not depend on the smallest eigenvalues which should be
regularized away in cases like this.
The potential problem is with eigenvalues that are just a but too large for
rcond to remove them but still have numerical problems.

I think it's a similar issue about the threshold as the one that you raised
some time ago with determining the matrix rank for near singular matrices.

Josef


>
> Cheers,
>
> Matthew
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160318/256b7cf2/attachment.html>

From matthew.brett at gmail.com  Fri Mar 18 15:56:04 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Fri, 18 Mar 2016 12:56:04 -0700
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAMMTP+AmwQb744C_bnZnKGMJHdyE1hUJEjNojT2KorkT8Q7xFA@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
 <CAH6Pt5rr6wA1ORmnDD7bF+fbGQ3N_f1_6ig8Uu5N0_Efu0wiqQ@mail.gmail.com>
 <CAMMTP+AmwQb744C_bnZnKGMJHdyE1hUJEjNojT2KorkT8Q7xFA@mail.gmail.com>
Message-ID: <CAH6Pt5qcAZF9b-gP+rAAz3yvH7TP+Y7NkSPPk7k+o73gZ+VWqQ@mail.gmail.com>

On Fri, Mar 18, 2016 at 12:19 PM,  <josef.pktd at gmail.com> wrote:
>
>
> On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>>
>> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>> >
>> >
>> > On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
>> >> <charlesr.harris at gmail.com> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett
>> >>> <matthew.brett at gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>> >>>> <charlesr.harris at gmail.com> wrote:
>> >>>> >
>> >>>> >
>> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>> >>>> > <charlesr.harris at gmail.com> wrote:
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
>> >>>> >> <matthew.brett at gmail.com>
>> >>>> >> wrote:
>> >>>> >>>
>> >>>> >>> Hi,
>> >>>> >>>
>> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
>> >>>> >>> <ralf.gommers at gmail.com>
>> >>>> >>> wrote:
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>> >>>> >>> > <matthew.brett at gmail.com>
>> >>>> >>> > wrote:
>> >>>> >>> >>
>> >>>> >>> >> Hi,
>> >>>> >>> >>
>> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for
>> >>>> >>> >> scipy
>> >>>> >>> >> 0.17.0:
>> >>>> >>> >>
>> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>> >>>> >>> >>
>> >>>> >>> >> This turns out to be due to small changes in the result of
>> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
>> >>>> >>> >> platform
>> >>>> >>> >> - the travis Ubuntu in this case.
>> >>>> >>> >>
>> >>>> >>> >> Is that expected?   I haven't investigated deeply, but I
>> >>>> >>> >> believe
>> >>>> >>> >> the
>> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly more
>> >>>> >>> >> accurate.  Is that possible?
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > That's very well possible, because we changed the underlying
>> >>>> >>> > LAPACK
>> >>>> >>> > routine
>> >>>> >>> > that's used:
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements.
>> >>>> >>> > This PR also had some lstsq precision issues, so could be
>> >>>> >>> > useful:
>> >>>> >>> > https://github.com/scipy/scipy/pull/5550
>> >>>> >>> >
>> >>>> >>> > What is the accuracy difference in your case?
>> >>>> >>>
>> >>>> >>> Damn - I thought you might ask me that!
>> >>>> >>>
>> >>>> >>> The test where it showed up comes from some calculations on a
>> >>>> >>> simple
>> >>>> >>> regression problem.
>> >>>> >>>
>> >>>> >>> The regression model is $y = X B + e$, with ordinary least
>> >>>> >>> squares
>> >>>> >>> fit
>> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix
>> >>>> >>> X.
>> >>>> >>>
>> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy
>> >>>> >>> 0.16
>> >>>> >>> pinv, compared to scipy 0.17 pinv:
>> >>>> >>>
>> >>>> >>> # SSE using high-precision sympy pinv
>> >>>> >>> 72.0232781366615313
>> >>>> >>> # SSE using scipy 0.16 pinv
>> >>>> >>> 72.0232781366625261
>> >>>> >>> # SSE using scipy 0.17 pinv
>> >>>> >>> 72.0232781390480170
>> >>>> >>
>> >>>> >>
>> >>>> >> That difference looks pretty big, looks almost like a cutoff
>> >>>> >> effect.
>> >>>> >> What
>> >>>> >> is the matrix X, in particular, what is its condition number or
>> >>>> >> singular
>> >>>> >> values? Might also be interesting to see residuals for the
>> >>>> >> different
>> >>>> >> functions.
>> >>>> >>
>> >>>> >> <snip>
>> >>>> >>
>> >>>> >
>> >>>> > Note that this might be an indication that the result of the
>> >>>> > calculation has
>> >>>> > an intrinsically large error bar.
>> >>>>
>> >>>> In [6]: np.linalg.svd(X, compute_uv=False)
>> >>>> Out[6]:
>> >>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>> >>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>> >>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>> >>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>> >>>>          1.72675581e-03,   9.51576795e-04])
>> >>>>
>> >>>> In [7]: np.linalg.cond(X)
>> >>>> Out[7]: 73278544935.300507
>> >>>
>> >>>
>> >>> Condition number is about 7e10, so you might expect to lose about 10
>> >>> digits of of accuracy just due to roundoff.  The covariance of the
>> >>> fitting
>> >>> error will be `inv(X.T, X) * error`, so you can look at the diagonals
>> >>> for
>> >>> the variance of the fitted parameters just for kicks, taking `error`
>> >>> to be
>> >>> roundoff error or measurement error. This is not to say that the
>> >>> results
>> >>> should not be more reproducible, but should be of interest for the
>> >>> validity
>> >>> of the solution. Note that I suspect that there may be a  difference
>> >>> in the
>> >>> default `rcond` used, or maybe the matrix is on the edge and a small
>> >>> change
>> >>> in results changes the effective rank.  You can run with the
>> >>> `return_rank=True` argument to check that. Looks like machine
>> >>> precision is
>> >>> used for the default value of rcond, so experimenting with that
>> >>> probable
>> >>> won't be informative, but you might want to try it anyway.
>> >>
>> >>
>> >>
>> >> statsmodels is using numpy pinv for the linear model and it works in
>> >> almost all cases very well *).
>> >>
>> >> Using pinv with singular (given the rcond threshold) X makes everything
>> >> well defined and not subject to numerical noise because of the
>> >> regularization induced by pinv.  Our covariance for the parameter
>> >> estimates,
>> >> based on pinv(X), is also not noisy but it's of reduced rank.
>> >
>> >
>> > What I'm thinking is that changing the algorithm, as here, is like
>> > introducing noise relative to the original, so even if the result is
>> > perfectly valid, the fitted parameters might change significantly. A
>> > change
>> > in the 11-th digit wouldn't surprise me at all, depending on the actual
>> > X
>> > and rhs.
>>
>> I believe the errors $e = y - X^+ y$ are in general robust to
>> differences between valid pinv solutions, even in extreme cases:
>>
>> In [2]: print_scalar(sse(X, npl.pinv(X), y))
>> 72.0232781366616024
>>
>> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y))
>> 72.0232781366939463
>>
>> In [4]: npl.cond(np.c_[X, X])
>> Out[4]: 6.6247965335098062e+26
>
>
>
> I think this does not depend on the smallest eigenvalues which should be
> regularized away in cases like this.
> The potential problem is with eigenvalues that are just a but too large for
> rcond to remove them but still have numerical problems.
>
> I think it's a similar issue about the threshold as the one that you raised
> some time ago with determining the matrix rank for near singular matrices.

My point was (I think you're agreeing) that the accuracy of the pinv
does not depend in a simple way on the condition number of the input
matrix.

I'm not quite sure how to assess the change in accuracy of scipy 0.17
on Linux, but it does seem to me to be significantly different (and
worse than) the other options (like numpy, scipy 0.16)

Matthew


From josef.pktd at gmail.com  Fri Mar 18 16:09:30 2016
From: josef.pktd at gmail.com (josef.pktd at gmail.com)
Date: Fri, 18 Mar 2016 16:09:30 -0400
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAH6Pt5qcAZF9b-gP+rAAz3yvH7TP+Y7NkSPPk7k+o73gZ+VWqQ@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
 <CAH6Pt5rr6wA1ORmnDD7bF+fbGQ3N_f1_6ig8Uu5N0_Efu0wiqQ@mail.gmail.com>
 <CAMMTP+AmwQb744C_bnZnKGMJHdyE1hUJEjNojT2KorkT8Q7xFA@mail.gmail.com>
 <CAH6Pt5qcAZF9b-gP+rAAz3yvH7TP+Y7NkSPPk7k+o73gZ+VWqQ@mail.gmail.com>
Message-ID: <CAMMTP+D=Xwe08Jj2hfYXEYeWz=K_MpzWvbknL9u-NThmYUD4pQ@mail.gmail.com>

On Fri, Mar 18, 2016 at 3:56 PM, Matthew Brett <matthew.brett at gmail.com>
wrote:

> On Fri, Mar 18, 2016 at 12:19 PM,  <josef.pktd at gmail.com> wrote:
> >
> >
> > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett <matthew.brett at gmail.com>
> > wrote:
> >>
> >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris
> >> <charlesr.harris at gmail.com> wrote:
> >> >
> >> >
> >> > On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
> >> >> <charlesr.harris at gmail.com> wrote:
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett
> >> >>> <matthew.brett at gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hi,
> >> >>>>
> >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
> >> >>>> <charlesr.harris at gmail.com> wrote:
> >> >>>> >
> >> >>>> >
> >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
> >> >>>> > <charlesr.harris at gmail.com> wrote:
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
> >> >>>> >> <matthew.brett at gmail.com>
> >> >>>> >> wrote:
> >> >>>> >>>
> >> >>>> >>> Hi,
> >> >>>> >>>
> >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
> >> >>>> >>> <ralf.gommers at gmail.com>
> >> >>>> >>> wrote:
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
> >> >>>> >>> > <matthew.brett at gmail.com>
> >> >>>> >>> > wrote:
> >> >>>> >>> >>
> >> >>>> >>> >> Hi,
> >> >>>> >>> >>
> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for
> >> >>>> >>> >> scipy
> >> >>>> >>> >> 0.17.0:
> >> >>>> >>> >>
> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
> >> >>>> >>> >>
> >> >>>> >>> >> This turns out to be due to small changes in the result of
> >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the same
> >> >>>> >>> >> platform
> >> >>>> >>> >> - the travis Ubuntu in this case.
> >> >>>> >>> >>
> >> >>>> >>> >> Is that expected?   I haven't investigated deeply, but I
> >> >>>> >>> >> believe
> >> >>>> >>> >> the
> >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly
> more
> >> >>>> >>> >> accurate.  Is that possible?
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> > That's very well possible, because we changed the underlying
> >> >>>> >>> > LAPACK
> >> >>>> >>> > routine
> >> >>>> >>> > that's used:
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> >
> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
> .
> >> >>>> >>> > This PR also had some lstsq precision issues, so could be
> >> >>>> >>> > useful:
> >> >>>> >>> > https://github.com/scipy/scipy/pull/5550
> >> >>>> >>> >
> >> >>>> >>> > What is the accuracy difference in your case?
> >> >>>> >>>
> >> >>>> >>> Damn - I thought you might ask me that!
> >> >>>> >>>
> >> >>>> >>> The test where it showed up comes from some calculations on a
> >> >>>> >>> simple
> >> >>>> >>> regression problem.
> >> >>>> >>>
> >> >>>> >>> The regression model is $y = X B + e$, with ordinary least
> >> >>>> >>> squares
> >> >>>> >>> fit
> >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of matrix
> >> >>>> >>> X.
> >> >>>> >>>
> >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy
> >> >>>> >>> 0.16
> >> >>>> >>> pinv, compared to scipy 0.17 pinv:
> >> >>>> >>>
> >> >>>> >>> # SSE using high-precision sympy pinv
> >> >>>> >>> 72.0232781366615313
> >> >>>> >>> # SSE using scipy 0.16 pinv
> >> >>>> >>> 72.0232781366625261
> >> >>>> >>> # SSE using scipy 0.17 pinv
> >> >>>> >>> 72.0232781390480170
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> That difference looks pretty big, looks almost like a cutoff
> >> >>>> >> effect.
> >> >>>> >> What
> >> >>>> >> is the matrix X, in particular, what is its condition number or
> >> >>>> >> singular
> >> >>>> >> values? Might also be interesting to see residuals for the
> >> >>>> >> different
> >> >>>> >> functions.
> >> >>>> >>
> >> >>>> >> <snip>
> >> >>>> >>
> >> >>>> >
> >> >>>> > Note that this might be an indication that the result of the
> >> >>>> > calculation has
> >> >>>> > an intrinsically large error bar.
> >> >>>>
> >> >>>> In [6]: np.linalg.svd(X, compute_uv=False)
> >> >>>> Out[6]:
> >> >>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
> >> >>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
> >> >>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
> >> >>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
> >> >>>>          1.72675581e-03,   9.51576795e-04])
> >> >>>>
> >> >>>> In [7]: np.linalg.cond(X)
> >> >>>> Out[7]: 73278544935.300507
> >> >>>
> >> >>>
> >> >>> Condition number is about 7e10, so you might expect to lose about 10
> >> >>> digits of of accuracy just due to roundoff.  The covariance of the
> >> >>> fitting
> >> >>> error will be `inv(X.T, X) * error`, so you can look at the
> diagonals
> >> >>> for
> >> >>> the variance of the fitted parameters just for kicks, taking `error`
> >> >>> to be
> >> >>> roundoff error or measurement error. This is not to say that the
> >> >>> results
> >> >>> should not be more reproducible, but should be of interest for the
> >> >>> validity
> >> >>> of the solution. Note that I suspect that there may be a  difference
> >> >>> in the
> >> >>> default `rcond` used, or maybe the matrix is on the edge and a small
> >> >>> change
> >> >>> in results changes the effective rank.  You can run with the
> >> >>> `return_rank=True` argument to check that. Looks like machine
> >> >>> precision is
> >> >>> used for the default value of rcond, so experimenting with that
> >> >>> probable
> >> >>> won't be informative, but you might want to try it anyway.
> >> >>
> >> >>
> >> >>
> >> >> statsmodels is using numpy pinv for the linear model and it works in
> >> >> almost all cases very well *).
> >> >>
> >> >> Using pinv with singular (given the rcond threshold) X makes
> everything
> >> >> well defined and not subject to numerical noise because of the
> >> >> regularization induced by pinv.  Our covariance for the parameter
> >> >> estimates,
> >> >> based on pinv(X), is also not noisy but it's of reduced rank.
> >> >
> >> >
> >> > What I'm thinking is that changing the algorithm, as here, is like
> >> > introducing noise relative to the original, so even if the result is
> >> > perfectly valid, the fitted parameters might change significantly. A
> >> > change
> >> > in the 11-th digit wouldn't surprise me at all, depending on the
> actual
> >> > X
> >> > and rhs.
> >>
> >> I believe the errors $e = y - X^+ y$ are in general robust to
> >> differences between valid pinv solutions, even in extreme cases:
> >>
> >> In [2]: print_scalar(sse(X, npl.pinv(X), y))
> >> 72.0232781366616024
> >>
> >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y))
> >> 72.0232781366939463
> >>
> >> In [4]: npl.cond(np.c_[X, X])
> >> Out[4]: 6.6247965335098062e+26
> >
> >
> >
> > I think this does not depend on the smallest eigenvalues which should be
> > regularized away in cases like this.
> > The potential problem is with eigenvalues that are just a but too large
> for
> > rcond to remove them but still have numerical problems.
> >
> > I think it's a similar issue about the threshold as the one that you
> raised
> > some time ago with determining the matrix rank for near singular
> matrices.
>
> My point was (I think you're agreeing) that the accuracy of the pinv
> does not depend in a simple way on the condition number of the input
> matrix.
>

That's not the point I wanted to make.
The condition number doesn't matter if there is an singular/eigenvalue with
rcond 1e-16 or 1e-30 or 1e-100.
However, the numerical accuracy might depend on the condition number when
there is a singular value with rcond of 2e-15 or 5e-15.
I didn't manage to find the default cond or rcond for scipy.linalg.pinv and
lstsq. The problem I had was with numpy's pinv.
I think there is a grey zone in rcond of eigenvalues that might be bad
numerically.

(My only experience is with SVD based pinv.)

Josef


>
> I'm not quite sure how to assess the change in accuracy of scipy 0.17
> on Linux, but it does seem to me to be significantly different (and
> worse than) the other options (like numpy, scipy 0.16)
>
> Matthew
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160318/cf98902a/attachment.html>

From josef.pktd at gmail.com  Fri Mar 18 16:19:58 2016
From: josef.pktd at gmail.com (josef.pktd at gmail.com)
Date: Fri, 18 Mar 2016 16:19:58 -0400
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAMMTP+D=Xwe08Jj2hfYXEYeWz=K_MpzWvbknL9u-NThmYUD4pQ@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
 <CAH6Pt5rr6wA1ORmnDD7bF+fbGQ3N_f1_6ig8Uu5N0_Efu0wiqQ@mail.gmail.com>
 <CAMMTP+AmwQb744C_bnZnKGMJHdyE1hUJEjNojT2KorkT8Q7xFA@mail.gmail.com>
 <CAH6Pt5qcAZF9b-gP+rAAz3yvH7TP+Y7NkSPPk7k+o73gZ+VWqQ@mail.gmail.com>
 <CAMMTP+D=Xwe08Jj2hfYXEYeWz=K_MpzWvbknL9u-NThmYUD4pQ@mail.gmail.com>
Message-ID: <CAMMTP+B72VFUK2JDv_YD=FkYN2ZZBLWmOV1qNrtz9N6d4Zoe_g@mail.gmail.com>

On Fri, Mar 18, 2016 at 4:09 PM, <josef.pktd at gmail.com> wrote:

>
>
> On Fri, Mar 18, 2016 at 3:56 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>
>> On Fri, Mar 18, 2016 at 12:19 PM,  <josef.pktd at gmail.com> wrote:
>> >
>> >
>> > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett <matthew.brett at gmail.com
>> >
>> > wrote:
>> >>
>> >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris
>> >> <charlesr.harris at gmail.com> wrote:
>> >> >
>> >> >
>> >> > On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
>> >> >> <charlesr.harris at gmail.com> wrote:
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett
>> >> >>> <matthew.brett at gmail.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Hi,
>> >> >>>>
>> >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>> >> >>>> <charlesr.harris at gmail.com> wrote:
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>> >> >>>> > <charlesr.harris at gmail.com> wrote:
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
>> >> >>>> >> <matthew.brett at gmail.com>
>> >> >>>> >> wrote:
>> >> >>>> >>>
>> >> >>>> >>> Hi,
>> >> >>>> >>>
>> >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
>> >> >>>> >>> <ralf.gommers at gmail.com>
>> >> >>>> >>> wrote:
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>> >> >>>> >>> > <matthew.brett at gmail.com>
>> >> >>>> >>> > wrote:
>> >> >>>> >>> >>
>> >> >>>> >>> >> Hi,
>> >> >>>> >>> >>
>> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails for
>> >> >>>> >>> >> scipy
>> >> >>>> >>> >> 0.17.0:
>> >> >>>> >>> >>
>> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>> >> >>>> >>> >>
>> >> >>>> >>> >> This turns out to be due to small changes in the result of
>> >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the
>> same
>> >> >>>> >>> >> platform
>> >> >>>> >>> >> - the travis Ubuntu in this case.
>> >> >>>> >>> >>
>> >> >>>> >>> >> Is that expected?   I haven't investigated deeply, but I
>> >> >>>> >>> >> believe
>> >> >>>> >>> >> the
>> >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly
>> more
>> >> >>>> >>> >> accurate.  Is that possible?
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> > That's very well possible, because we changed the underlying
>> >> >>>> >>> > LAPACK
>> >> >>>> >>> > routine
>> >> >>>> >>> > that's used:
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> >
>> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
>> .
>> >> >>>> >>> > This PR also had some lstsq precision issues, so could be
>> >> >>>> >>> > useful:
>> >> >>>> >>> > https://github.com/scipy/scipy/pull/5550
>> >> >>>> >>> >
>> >> >>>> >>> > What is the accuracy difference in your case?
>> >> >>>> >>>
>> >> >>>> >>> Damn - I thought you might ask me that!
>> >> >>>> >>>
>> >> >>>> >>> The test where it showed up comes from some calculations on a
>> >> >>>> >>> simple
>> >> >>>> >>> regression problem.
>> >> >>>> >>>
>> >> >>>> >>> The regression model is $y = X B + e$, with ordinary least
>> >> >>>> >>> squares
>> >> >>>> >>> fit
>> >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of
>> matrix
>> >> >>>> >>> X.
>> >> >>>> >>>
>> >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using scipy
>> >> >>>> >>> 0.16
>> >> >>>> >>> pinv, compared to scipy 0.17 pinv:
>> >> >>>> >>>
>> >> >>>> >>> # SSE using high-precision sympy pinv
>> >> >>>> >>> 72.0232781366615313
>> >> >>>> >>> # SSE using scipy 0.16 pinv
>> >> >>>> >>> 72.0232781366625261
>> >> >>>> >>> # SSE using scipy 0.17 pinv
>> >> >>>> >>> 72.0232781390480170
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> That difference looks pretty big, looks almost like a cutoff
>> >> >>>> >> effect.
>> >> >>>> >> What
>> >> >>>> >> is the matrix X, in particular, what is its condition number or
>> >> >>>> >> singular
>> >> >>>> >> values? Might also be interesting to see residuals for the
>> >> >>>> >> different
>> >> >>>> >> functions.
>> >> >>>> >>
>> >> >>>> >> <snip>
>> >> >>>> >>
>> >> >>>> >
>> >> >>>> > Note that this might be an indication that the result of the
>> >> >>>> > calculation has
>> >> >>>> > an intrinsically large error bar.
>> >> >>>>
>> >> >>>> In [6]: np.linalg.svd(X, compute_uv=False)
>> >> >>>> Out[6]:
>> >> >>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>> >> >>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>> >> >>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>> >> >>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>> >> >>>>          1.72675581e-03,   9.51576795e-04])
>> >> >>>>
>> >> >>>> In [7]: np.linalg.cond(X)
>> >> >>>> Out[7]: 73278544935.300507
>> >> >>>
>> >> >>>
>> >> >>> Condition number is about 7e10, so you might expect to lose about
>> 10
>> >> >>> digits of of accuracy just due to roundoff.  The covariance of the
>> >> >>> fitting
>> >> >>> error will be `inv(X.T, X) * error`, so you can look at the
>> diagonals
>> >> >>> for
>> >> >>> the variance of the fitted parameters just for kicks, taking
>> `error`
>> >> >>> to be
>> >> >>> roundoff error or measurement error. This is not to say that the
>> >> >>> results
>> >> >>> should not be more reproducible, but should be of interest for the
>> >> >>> validity
>> >> >>> of the solution. Note that I suspect that there may be a
>> difference
>> >> >>> in the
>> >> >>> default `rcond` used, or maybe the matrix is on the edge and a
>> small
>> >> >>> change
>> >> >>> in results changes the effective rank.  You can run with the
>> >> >>> `return_rank=True` argument to check that. Looks like machine
>> >> >>> precision is
>> >> >>> used for the default value of rcond, so experimenting with that
>> >> >>> probable
>> >> >>> won't be informative, but you might want to try it anyway.
>> >> >>
>> >> >>
>> >> >>
>> >> >> statsmodels is using numpy pinv for the linear model and it works in
>> >> >> almost all cases very well *).
>> >> >>
>> >> >> Using pinv with singular (given the rcond threshold) X makes
>> everything
>> >> >> well defined and not subject to numerical noise because of the
>> >> >> regularization induced by pinv.  Our covariance for the parameter
>> >> >> estimates,
>> >> >> based on pinv(X), is also not noisy but it's of reduced rank.
>> >> >
>> >> >
>> >> > What I'm thinking is that changing the algorithm, as here, is like
>> >> > introducing noise relative to the original, so even if the result is
>> >> > perfectly valid, the fitted parameters might change significantly. A
>> >> > change
>> >> > in the 11-th digit wouldn't surprise me at all, depending on the
>> actual
>> >> > X
>> >> > and rhs.
>> >>
>> >> I believe the errors $e = y - X^+ y$ are in general robust to
>> >> differences between valid pinv solutions, even in extreme cases:
>> >>
>> >> In [2]: print_scalar(sse(X, npl.pinv(X), y))
>> >> 72.0232781366616024
>> >>
>> >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y))
>> >> 72.0232781366939463
>> >>
>> >> In [4]: npl.cond(np.c_[X, X])
>> >> Out[4]: 6.6247965335098062e+26
>> >
>> >
>> >
>> > I think this does not depend on the smallest eigenvalues which should be
>> > regularized away in cases like this.
>> > The potential problem is with eigenvalues that are just a but too large
>> for
>> > rcond to remove them but still have numerical problems.
>> >
>> > I think it's a similar issue about the threshold as the one that you
>> raised
>> > some time ago with determining the matrix rank for near singular
>> matrices.
>>
>> My point was (I think you're agreeing) that the accuracy of the pinv
>> does not depend in a simple way on the condition number of the input
>> matrix.
>>
>
> That's not the point I wanted to make.
> The condition number doesn't matter if there is an singular/eigenvalue
> with rcond 1e-16 or 1e-30 or 1e-100.
> However, the numerical accuracy might depend on the condition number when
> there is a singular value with rcond of 2e-15 or 5e-15.
> I didn't manage to find the default cond or rcond for scipy.linalg.pinv
> and lstsq. The problem I had was with numpy's pinv.
> I think there is a grey zone in rcond of eigenvalues that might be bad
> numerically.
>
> (My only experience is with SVD based pinv.)
>

In the singular values array that you showed earlier, you have three small
eigenvalues, but their rcond are only around 1e-11.
So from my experience with SVD pinv, there shouldn't be any problem with a
pinv in this case.


Josef


>
> Josef
>
>
>>
>> I'm not quite sure how to assess the change in accuracy of scipy 0.17
>> on Linux, but it does seem to me to be significantly different (and
>> worse than) the other options (like numpy, scipy 0.16)
>>
>> Matthew
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> https://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160318/15b1acfe/attachment.html>

From matthew.brett at gmail.com  Sat Mar 19 15:31:32 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Sat, 19 Mar 2016 12:31:32 -0700
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAMMTP+B72VFUK2JDv_YD=FkYN2ZZBLWmOV1qNrtz9N6d4Zoe_g@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
 <CAH6Pt5rr6wA1ORmnDD7bF+fbGQ3N_f1_6ig8Uu5N0_Efu0wiqQ@mail.gmail.com>
 <CAMMTP+AmwQb744C_bnZnKGMJHdyE1hUJEjNojT2KorkT8Q7xFA@mail.gmail.com>
 <CAH6Pt5qcAZF9b-gP+rAAz3yvH7TP+Y7NkSPPk7k+o73gZ+VWqQ@mail.gmail.com>
 <CAMMTP+D=Xwe08Jj2hfYXEYeWz=K_MpzWvbknL9u-NThmYUD4pQ@mail.gmail.com>
 <CAMMTP+B72VFUK2JDv_YD=FkYN2ZZBLWmOV1qNrtz9N6d4Zoe_g@mail.gmail.com>
Message-ID: <CAH6Pt5qdX_cnYhHqpH+WPusXy2xkqFmuEWH1ZXnJ1Y=VxBb1HQ@mail.gmail.com>

On Fri, Mar 18, 2016 at 1:19 PM,  <josef.pktd at gmail.com> wrote:
>
>
> On Fri, Mar 18, 2016 at 4:09 PM, <josef.pktd at gmail.com> wrote:
>>
>>
>>
>> On Fri, Mar 18, 2016 at 3:56 PM, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>>>
>>> On Fri, Mar 18, 2016 at 12:19 PM,  <josef.pktd at gmail.com> wrote:
>>> >
>>> >
>>> > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett
>>> > <matthew.brett at gmail.com>
>>> > wrote:
>>> >>
>>> >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris
>>> >> <charlesr.harris at gmail.com> wrote:
>>> >> >
>>> >> >
>>> >> > On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
>>> >> >> <charlesr.harris at gmail.com> wrote:
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett
>>> >> >>> <matthew.brett at gmail.com>
>>> >> >>> wrote:
>>> >> >>>>
>>> >> >>>> Hi,
>>> >> >>>>
>>> >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
>>> >> >>>> <charlesr.harris at gmail.com> wrote:
>>> >> >>>> >
>>> >> >>>> >
>>> >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
>>> >> >>>> > <charlesr.harris at gmail.com> wrote:
>>> >> >>>> >>
>>> >> >>>> >>
>>> >> >>>> >>
>>> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
>>> >> >>>> >> <matthew.brett at gmail.com>
>>> >> >>>> >> wrote:
>>> >> >>>> >>>
>>> >> >>>> >>> Hi,
>>> >> >>>> >>>
>>> >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
>>> >> >>>> >>> <ralf.gommers at gmail.com>
>>> >> >>>> >>> wrote:
>>> >> >>>> >>> >
>>> >> >>>> >>> >
>>> >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
>>> >> >>>> >>> > <matthew.brett at gmail.com>
>>> >> >>>> >>> > wrote:
>>> >> >>>> >>> >>
>>> >> >>>> >>> >> Hi,
>>> >> >>>> >>> >>
>>> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails
>>> >> >>>> >>> >> for
>>> >> >>>> >>> >> scipy
>>> >> >>>> >>> >> 0.17.0:
>>> >> >>>> >>> >>
>>> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
>>> >> >>>> >>> >>
>>> >> >>>> >>> >> This turns out to be due to small changes in the result of
>>> >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the
>>> >> >>>> >>> >> same
>>> >> >>>> >>> >> platform
>>> >> >>>> >>> >> - the travis Ubuntu in this case.
>>> >> >>>> >>> >>
>>> >> >>>> >>> >> Is that expected?   I haven't investigated deeply, but I
>>> >> >>>> >>> >> believe
>>> >> >>>> >>> >> the
>>> >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly
>>> >> >>>> >>> >> more
>>> >> >>>> >>> >> accurate.  Is that possible?
>>> >> >>>> >>> >
>>> >> >>>> >>> >
>>> >> >>>> >>> > That's very well possible, because we changed the
>>> >> >>>> >>> > underlying
>>> >> >>>> >>> > LAPACK
>>> >> >>>> >>> > routine
>>> >> >>>> >>> > that's used:
>>> >> >>>> >>> >
>>> >> >>>> >>> >
>>> >> >>>> >>> >
>>> >> >>>> >>> >
>>> >> >>>> >>> > https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements.
>>> >> >>>> >>> > This PR also had some lstsq precision issues, so could be
>>> >> >>>> >>> > useful:
>>> >> >>>> >>> > https://github.com/scipy/scipy/pull/5550
>>> >> >>>> >>> >
>>> >> >>>> >>> > What is the accuracy difference in your case?
>>> >> >>>> >>>
>>> >> >>>> >>> Damn - I thought you might ask me that!
>>> >> >>>> >>>
>>> >> >>>> >>> The test where it showed up comes from some calculations on a
>>> >> >>>> >>> simple
>>> >> >>>> >>> regression problem.
>>> >> >>>> >>>
>>> >> >>>> >>> The regression model is $y = X B + e$, with ordinary least
>>> >> >>>> >>> squares
>>> >> >>>> >>> fit
>>> >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of
>>> >> >>>> >>> matrix
>>> >> >>>> >>> X.
>>> >> >>>> >>>
>>> >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using
>>> >> >>>> >>> scipy
>>> >> >>>> >>> 0.16
>>> >> >>>> >>> pinv, compared to scipy 0.17 pinv:
>>> >> >>>> >>>
>>> >> >>>> >>> # SSE using high-precision sympy pinv
>>> >> >>>> >>> 72.0232781366615313
>>> >> >>>> >>> # SSE using scipy 0.16 pinv
>>> >> >>>> >>> 72.0232781366625261
>>> >> >>>> >>> # SSE using scipy 0.17 pinv
>>> >> >>>> >>> 72.0232781390480170
>>> >> >>>> >>
>>> >> >>>> >>
>>> >> >>>> >> That difference looks pretty big, looks almost like a cutoff
>>> >> >>>> >> effect.
>>> >> >>>> >> What
>>> >> >>>> >> is the matrix X, in particular, what is its condition number
>>> >> >>>> >> or
>>> >> >>>> >> singular
>>> >> >>>> >> values? Might also be interesting to see residuals for the
>>> >> >>>> >> different
>>> >> >>>> >> functions.
>>> >> >>>> >>
>>> >> >>>> >> <snip>
>>> >> >>>> >>
>>> >> >>>> >
>>> >> >>>> > Note that this might be an indication that the result of the
>>> >> >>>> > calculation has
>>> >> >>>> > an intrinsically large error bar.
>>> >> >>>>
>>> >> >>>> In [6]: np.linalg.svd(X, compute_uv=False)
>>> >> >>>> Out[6]:
>>> >> >>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
>>> >> >>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
>>> >> >>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
>>> >> >>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
>>> >> >>>>          1.72675581e-03,   9.51576795e-04])
>>> >> >>>>
>>> >> >>>> In [7]: np.linalg.cond(X)
>>> >> >>>> Out[7]: 73278544935.300507
>>> >> >>>
>>> >> >>>
>>> >> >>> Condition number is about 7e10, so you might expect to lose about
>>> >> >>> 10
>>> >> >>> digits of of accuracy just due to roundoff.  The covariance of the
>>> >> >>> fitting
>>> >> >>> error will be `inv(X.T, X) * error`, so you can look at the
>>> >> >>> diagonals
>>> >> >>> for
>>> >> >>> the variance of the fitted parameters just for kicks, taking
>>> >> >>> `error`
>>> >> >>> to be
>>> >> >>> roundoff error or measurement error. This is not to say that the
>>> >> >>> results
>>> >> >>> should not be more reproducible, but should be of interest for the
>>> >> >>> validity
>>> >> >>> of the solution. Note that I suspect that there may be a
>>> >> >>> difference
>>> >> >>> in the
>>> >> >>> default `rcond` used, or maybe the matrix is on the edge and a
>>> >> >>> small
>>> >> >>> change
>>> >> >>> in results changes the effective rank.  You can run with the
>>> >> >>> `return_rank=True` argument to check that. Looks like machine
>>> >> >>> precision is
>>> >> >>> used for the default value of rcond, so experimenting with that
>>> >> >>> probable
>>> >> >>> won't be informative, but you might want to try it anyway.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> statsmodels is using numpy pinv for the linear model and it works
>>> >> >> in
>>> >> >> almost all cases very well *).
>>> >> >>
>>> >> >> Using pinv with singular (given the rcond threshold) X makes
>>> >> >> everything
>>> >> >> well defined and not subject to numerical noise because of the
>>> >> >> regularization induced by pinv.  Our covariance for the parameter
>>> >> >> estimates,
>>> >> >> based on pinv(X), is also not noisy but it's of reduced rank.
>>> >> >
>>> >> >
>>> >> > What I'm thinking is that changing the algorithm, as here, is like
>>> >> > introducing noise relative to the original, so even if the result is
>>> >> > perfectly valid, the fitted parameters might change significantly. A
>>> >> > change
>>> >> > in the 11-th digit wouldn't surprise me at all, depending on the
>>> >> > actual
>>> >> > X
>>> >> > and rhs.
>>> >>
>>> >> I believe the errors $e = y - X^+ y$ are in general robust to
>>> >> differences between valid pinv solutions, even in extreme cases:
>>> >>
>>> >> In [2]: print_scalar(sse(X, npl.pinv(X), y))
>>> >> 72.0232781366616024
>>> >>
>>> >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y))
>>> >> 72.0232781366939463
>>> >>
>>> >> In [4]: npl.cond(np.c_[X, X])
>>> >> Out[4]: 6.6247965335098062e+26
>>> >
>>> >
>>> >
>>> > I think this does not depend on the smallest eigenvalues which should
>>> > be
>>> > regularized away in cases like this.
>>> > The potential problem is with eigenvalues that are just a but too large
>>> > for
>>> > rcond to remove them but still have numerical problems.
>>> >
>>> > I think it's a similar issue about the threshold as the one that you
>>> > raised
>>> > some time ago with determining the matrix rank for near singular
>>> > matrices.
>>>
>>> My point was (I think you're agreeing) that the accuracy of the pinv
>>> does not depend in a simple way on the condition number of the input
>>> matrix.
>>
>>
>> That's not the point I wanted to make.
>> The condition number doesn't matter if there is an singular/eigenvalue
>> with rcond 1e-16 or 1e-30 or 1e-100.
>> However, the numerical accuracy might depend on the condition number when
>> there is a singular value with rcond of 2e-15 or 5e-15.
>> I didn't manage to find the default cond or rcond for scipy.linalg.pinv
>> and lstsq. The problem I had was with numpy's pinv.
>> I think there is a grey zone in rcond of eigenvalues that might be bad
>> numerically.
>>
>> (My only experience is with SVD based pinv.)
>
>
> In the singular values array that you showed earlier, you have three small
> eigenvalues, but their rcond are only around 1e-11.
> So from my experience with SVD pinv, there shouldn't be any problem with a
> pinv in this case.

So - is there any downside to recommending numpy.linalg.pinv over
scipy.linalg.pinv in general?

Cheers,

Matthew


From josef.pktd at gmail.com  Sat Mar 19 15:58:10 2016
From: josef.pktd at gmail.com (josef.pktd at gmail.com)
Date: Sat, 19 Mar 2016 15:58:10 -0400
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAH6Pt5qdX_cnYhHqpH+WPusXy2xkqFmuEWH1ZXnJ1Y=VxBb1HQ@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
 <CAH6Pt5rr6wA1ORmnDD7bF+fbGQ3N_f1_6ig8Uu5N0_Efu0wiqQ@mail.gmail.com>
 <CAMMTP+AmwQb744C_bnZnKGMJHdyE1hUJEjNojT2KorkT8Q7xFA@mail.gmail.com>
 <CAH6Pt5qcAZF9b-gP+rAAz3yvH7TP+Y7NkSPPk7k+o73gZ+VWqQ@mail.gmail.com>
 <CAMMTP+D=Xwe08Jj2hfYXEYeWz=K_MpzWvbknL9u-NThmYUD4pQ@mail.gmail.com>
 <CAMMTP+B72VFUK2JDv_YD=FkYN2ZZBLWmOV1qNrtz9N6d4Zoe_g@mail.gmail.com>
 <CAH6Pt5qdX_cnYhHqpH+WPusXy2xkqFmuEWH1ZXnJ1Y=VxBb1HQ@mail.gmail.com>
Message-ID: <CAMMTP+ByL1Wwa6okPS9gFL3J7e7v5dWXRgse5k0FtYQ+F=_snA@mail.gmail.com>

On Sat, Mar 19, 2016 at 3:31 PM, Matthew Brett <matthew.brett at gmail.com>
wrote:

> On Fri, Mar 18, 2016 at 1:19 PM,  <josef.pktd at gmail.com> wrote:
> >
> >
> > On Fri, Mar 18, 2016 at 4:09 PM, <josef.pktd at gmail.com> wrote:
> >>
> >>
> >>
> >> On Fri, Mar 18, 2016 at 3:56 PM, Matthew Brett <matthew.brett at gmail.com
> >
> >> wrote:
> >>>
> >>> On Fri, Mar 18, 2016 at 12:19 PM,  <josef.pktd at gmail.com> wrote:
> >>> >
> >>> >
> >>> > On Fri, Mar 18, 2016 at 3:07 PM, Matthew Brett
> >>> > <matthew.brett at gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> On Wed, Mar 16, 2016 at 8:49 PM, Charles R Harris
> >>> >> <charlesr.harris at gmail.com> wrote:
> >>> >> >
> >>> >> >
> >>> >> > On Wed, Mar 16, 2016 at 9:39 PM, <josef.pktd at gmail.com> wrote:
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Wed, Mar 16, 2016 at 11:01 PM, Charles R Harris
> >>> >> >> <charlesr.harris at gmail.com> wrote:
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On Wed, Mar 16, 2016 at 7:16 PM, Matthew Brett
> >>> >> >>> <matthew.brett at gmail.com>
> >>> >> >>> wrote:
> >>> >> >>>>
> >>> >> >>>> Hi,
> >>> >> >>>>
> >>> >> >>>> On Wed, Mar 16, 2016 at 5:48 PM, Charles R Harris
> >>> >> >>>> <charlesr.harris at gmail.com> wrote:
> >>> >> >>>> >
> >>> >> >>>> >
> >>> >> >>>> > On Wed, Mar 16, 2016 at 6:41 PM, Charles R Harris
> >>> >> >>>> > <charlesr.harris at gmail.com> wrote:
> >>> >> >>>> >>
> >>> >> >>>> >>
> >>> >> >>>> >>
> >>> >> >>>> >> On Wed, Mar 16, 2016 at 6:05 PM, Matthew Brett
> >>> >> >>>> >> <matthew.brett at gmail.com>
> >>> >> >>>> >> wrote:
> >>> >> >>>> >>>
> >>> >> >>>> >>> Hi,
> >>> >> >>>> >>>
> >>> >> >>>> >>> On Sun, Mar 13, 2016 at 3:43 AM, Ralf Gommers
> >>> >> >>>> >>> <ralf.gommers at gmail.com>
> >>> >> >>>> >>> wrote:
> >>> >> >>>> >>> >
> >>> >> >>>> >>> >
> >>> >> >>>> >>> > On Sat, Mar 12, 2016 at 6:16 AM, Matthew Brett
> >>> >> >>>> >>> > <matthew.brett at gmail.com>
> >>> >> >>>> >>> > wrote:
> >>> >> >>>> >>> >>
> >>> >> >>>> >>> >> Hi,
> >>> >> >>>> >>> >>
> >>> >> >>>> >>> >> I just noticed that our (neuroimaging) test suite fails
> >>> >> >>>> >>> >> for
> >>> >> >>>> >>> >> scipy
> >>> >> >>>> >>> >> 0.17.0:
> >>> >> >>>> >>> >>
> >>> >> >>>> >>> >> https://travis-ci.org/matthew-brett/nipy/jobs/115471563
> >>> >> >>>> >>> >>
> >>> >> >>>> >>> >> This turns out to be due to small changes in the result
> of
> >>> >> >>>> >>> >> `scipy.linalg.lstsq` between 0.16.1 and 0.17.0 - on the
> >>> >> >>>> >>> >> same
> >>> >> >>>> >>> >> platform
> >>> >> >>>> >>> >> - the travis Ubuntu in this case.
> >>> >> >>>> >>> >>
> >>> >> >>>> >>> >> Is that expected?   I haven't investigated deeply, but I
> >>> >> >>>> >>> >> believe
> >>> >> >>>> >>> >> the
> >>> >> >>>> >>> >> 0.16.1 result (back into the depths of time) is slightly
> >>> >> >>>> >>> >> more
> >>> >> >>>> >>> >> accurate.  Is that possible?
> >>> >> >>>> >>> >
> >>> >> >>>> >>> >
> >>> >> >>>> >>> > That's very well possible, because we changed the
> >>> >> >>>> >>> > underlying
> >>> >> >>>> >>> > LAPACK
> >>> >> >>>> >>> > routine
> >>> >> >>>> >>> > that's used:
> >>> >> >>>> >>> >
> >>> >> >>>> >>> >
> >>> >> >>>> >>> >
> >>> >> >>>> >>> >
> >>> >> >>>> >>> >
> https://github.com/scipy/scipy/blob/master/doc/release/0.17.0-notes.rst#scipy-linalg-improvements
> .
> >>> >> >>>> >>> > This PR also had some lstsq precision issues, so could be
> >>> >> >>>> >>> > useful:
> >>> >> >>>> >>> > https://github.com/scipy/scipy/pull/5550
> >>> >> >>>> >>> >
> >>> >> >>>> >>> > What is the accuracy difference in your case?
> >>> >> >>>> >>>
> >>> >> >>>> >>> Damn - I thought you might ask me that!
> >>> >> >>>> >>>
> >>> >> >>>> >>> The test where it showed up comes from some calculations
> on a
> >>> >> >>>> >>> simple
> >>> >> >>>> >>> regression problem.
> >>> >> >>>> >>>
> >>> >> >>>> >>> The regression model is $y = X B + e$, with ordinary least
> >>> >> >>>> >>> squares
> >>> >> >>>> >>> fit
> >>> >> >>>> >>> given by $B = X^+ y$ where $X^+$ is the pseudoinverse of
> >>> >> >>>> >>> matrix
> >>> >> >>>> >>> X.
> >>> >> >>>> >>>
> >>> >> >>>> >>> For my particular X, $\Sigma e^2$ (SSE) is smaller using
> >>> >> >>>> >>> scipy
> >>> >> >>>> >>> 0.16
> >>> >> >>>> >>> pinv, compared to scipy 0.17 pinv:
> >>> >> >>>> >>>
> >>> >> >>>> >>> # SSE using high-precision sympy pinv
> >>> >> >>>> >>> 72.0232781366615313
> >>> >> >>>> >>> # SSE using scipy 0.16 pinv
> >>> >> >>>> >>> 72.0232781366625261
> >>> >> >>>> >>> # SSE using scipy 0.17 pinv
> >>> >> >>>> >>> 72.0232781390480170
> >>> >> >>>> >>
> >>> >> >>>> >>
> >>> >> >>>> >> That difference looks pretty big, looks almost like a cutoff
> >>> >> >>>> >> effect.
> >>> >> >>>> >> What
> >>> >> >>>> >> is the matrix X, in particular, what is its condition number
> >>> >> >>>> >> or
> >>> >> >>>> >> singular
> >>> >> >>>> >> values? Might also be interesting to see residuals for the
> >>> >> >>>> >> different
> >>> >> >>>> >> functions.
> >>> >> >>>> >>
> >>> >> >>>> >> <snip>
> >>> >> >>>> >>
> >>> >> >>>> >
> >>> >> >>>> > Note that this might be an indication that the result of the
> >>> >> >>>> > calculation has
> >>> >> >>>> > an intrinsically large error bar.
> >>> >> >>>>
> >>> >> >>>> In [6]: np.linalg.svd(X, compute_uv=False)
> >>> >> >>>> Out[6]:
> >>> >> >>>> array([  6.97301629e+07,   5.78702921e+05,   5.39646908e+05,
> >>> >> >>>>          2.99383626e+05,   3.24329208e+04,   1.29520746e+02,
> >>> >> >>>>          3.16197688e+01,   2.66011711e+00,   2.18287258e-01,
> >>> >> >>>>          1.60105276e-01,   1.11807228e-01,   2.12694535e-03,
> >>> >> >>>>          1.72675581e-03,   9.51576795e-04])
> >>> >> >>>>
> >>> >> >>>> In [7]: np.linalg.cond(X)
> >>> >> >>>> Out[7]: 73278544935.300507
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Condition number is about 7e10, so you might expect to lose
> about
> >>> >> >>> 10
> >>> >> >>> digits of of accuracy just due to roundoff.  The covariance of
> the
> >>> >> >>> fitting
> >>> >> >>> error will be `inv(X.T, X) * error`, so you can look at the
> >>> >> >>> diagonals
> >>> >> >>> for
> >>> >> >>> the variance of the fitted parameters just for kicks, taking
> >>> >> >>> `error`
> >>> >> >>> to be
> >>> >> >>> roundoff error or measurement error. This is not to say that the
> >>> >> >>> results
> >>> >> >>> should not be more reproducible, but should be of interest for
> the
> >>> >> >>> validity
> >>> >> >>> of the solution. Note that I suspect that there may be a
> >>> >> >>> difference
> >>> >> >>> in the
> >>> >> >>> default `rcond` used, or maybe the matrix is on the edge and a
> >>> >> >>> small
> >>> >> >>> change
> >>> >> >>> in results changes the effective rank.  You can run with the
> >>> >> >>> `return_rank=True` argument to check that. Looks like machine
> >>> >> >>> precision is
> >>> >> >>> used for the default value of rcond, so experimenting with that
> >>> >> >>> probable
> >>> >> >>> won't be informative, but you might want to try it anyway.
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> statsmodels is using numpy pinv for the linear model and it works
> >>> >> >> in
> >>> >> >> almost all cases very well *).
> >>> >> >>
> >>> >> >> Using pinv with singular (given the rcond threshold) X makes
> >>> >> >> everything
> >>> >> >> well defined and not subject to numerical noise because of the
> >>> >> >> regularization induced by pinv.  Our covariance for the parameter
> >>> >> >> estimates,
> >>> >> >> based on pinv(X), is also not noisy but it's of reduced rank.
> >>> >> >
> >>> >> >
> >>> >> > What I'm thinking is that changing the algorithm, as here, is like
> >>> >> > introducing noise relative to the original, so even if the result
> is
> >>> >> > perfectly valid, the fitted parameters might change
> significantly. A
> >>> >> > change
> >>> >> > in the 11-th digit wouldn't surprise me at all, depending on the
> >>> >> > actual
> >>> >> > X
> >>> >> > and rhs.
> >>> >>
> >>> >> I believe the errors $e = y - X^+ y$ are in general robust to
> >>> >> differences between valid pinv solutions, even in extreme cases:
> >>> >>
> >>> >> In [2]: print_scalar(sse(X, npl.pinv(X), y))
> >>> >> 72.0232781366616024
> >>> >>
> >>> >> In [3]: print_scalar(sse(np.c_[X, X], npl.pinv(np.c_[X, X]), y))
> >>> >> 72.0232781366939463
> >>> >>
> >>> >> In [4]: npl.cond(np.c_[X, X])
> >>> >> Out[4]: 6.6247965335098062e+26
> >>> >
> >>> >
> >>> >
> >>> > I think this does not depend on the smallest eigenvalues which should
> >>> > be
> >>> > regularized away in cases like this.
> >>> > The potential problem is with eigenvalues that are just a but too
> large
> >>> > for
> >>> > rcond to remove them but still have numerical problems.
> >>> >
> >>> > I think it's a similar issue about the threshold as the one that you
> >>> > raised
> >>> > some time ago with determining the matrix rank for near singular
> >>> > matrices.
> >>>
> >>> My point was (I think you're agreeing) that the accuracy of the pinv
> >>> does not depend in a simple way on the condition number of the input
> >>> matrix.
> >>
> >>
> >> That's not the point I wanted to make.
> >> The condition number doesn't matter if there is an singular/eigenvalue
> >> with rcond 1e-16 or 1e-30 or 1e-100.
> >> However, the numerical accuracy might depend on the condition number
> when
> >> there is a singular value with rcond of 2e-15 or 5e-15.
> >> I didn't manage to find the default cond or rcond for scipy.linalg.pinv
> >> and lstsq. The problem I had was with numpy's pinv.
> >> I think there is a grey zone in rcond of eigenvalues that might be bad
> >> numerically.
> >>
> >> (My only experience is with SVD based pinv.)
> >
> >
> > In the singular values array that you showed earlier, you have three
> small
> > eigenvalues, but their rcond are only around 1e-11.
> > So from my experience with SVD pinv, there shouldn't be any problem with
> a
> > pinv in this case.
>
> So - is there any downside to recommending numpy.linalg.pinv over
> scipy.linalg.pinv in general?
>

The only time that I did a comparison was several years ago with the ill
conditioned NIST test case.

Some of the scipy pinv where a tiny bit more accurate than the numpy pinv.
Smallest singular or eigenvalues were zero +/- noise.
My impression was that in edge cases where numerical noise matters, some
implementation details or which linalg library is used can have a visible
effect.
(I still have never read a numerical linear algebra book, and my
"incompetence" from a thread many years ago is only tempered by 7 more
years of usage.)

I'm pretty happy with using np.linalg.pinv, although there are also a few
cases of noise results. The main reason I stayed with numpy's pinv for
statsmodels was that it was quite a bit faster than scipy pinv, partially
because it doesn't do the isfinite check and maybe for some other
algorithmic reasons. I haven't compared the two since then, and scipy's
pinv has changed since then.

statsmodels unit test for these kind of good problems are usually in the
rtol=1e-12 to 1e-13 range. Going up to 1e-14 usually failed on some machine
during release testing.
Problems that are not so nice have lower precision.

Josef


>
> Cheers,
>
> Matthew
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160319/7a8f38e7/attachment.html>

From charlesr.harris at gmail.com  Sat Mar 19 19:27:03 2016
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Sat, 19 Mar 2016 17:27:03 -0600
Subject: [SciPy-Dev] Numpy 1.11.0rc2 released.
Message-ID: <CAB6mnx+qOO-Uu5iL43p+FeOoF1gdc3QbSzQ8qg3DGR3UDxyExw@mail.gmail.com>

Hi All,

Numpy 1.11.0rc2 has been released. There have been a few reversions and
fixes since 1.11.0rc1, but by and large things have been pretty stable and
unless something terrible happens, I will make the final next weekend.
Source files and documentation can be found on Sourceforge
<https://sourceforge.net/projects/numpy/files/NumPy/1.11.0rc2/>, while
source files and OS X wheels for Python 2.7, 3.3, 3.4, and 3.5 can be
installed from Pypi. Please test and yell loudly if there is a problem.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160319/59433753/attachment.html>

From olivier.grisel at ensta.org  Mon Mar 21 04:11:19 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Mon, 21 Mar 2016 09:11:19 +0100
Subject: [SciPy-Dev] Slight change in lstsq accuracy for 0.17 - expected?
In-Reply-To: <CAMMTP+ByL1Wwa6okPS9gFL3J7e7v5dWXRgse5k0FtYQ+F=_snA@mail.gmail.com>
References: <CAH6Pt5oE8kYwM4NXa2AE2wE00bYM9Eq1tmgXcJR7_TFs0twvSA@mail.gmail.com>
 <CABL7CQiO2b-R-vPsHdnaSCUFHjmQCwC3FWoZ=oFUCR+J9YfJYA@mail.gmail.com>
 <CAH6Pt5oyENK5bsb3Yfxq0v9PozuPz0PVA=HE0+_uY7fa4a_02w@mail.gmail.com>
 <CAB6mnxJWg_mAZBb+qFyWKE5FPbcLLbkvQN06u6AmTqQmjgnM0A@mail.gmail.com>
 <CAB6mnx+dHcde22QwvnFaqwXBE4dAgvu5A9um3qOb1SEF=_mmhg@mail.gmail.com>
 <CAH6Pt5pmrTUxb5ux=6N=Ykc4ahNS=gCky1Kb-kyQt25Xf3Xd6Q@mail.gmail.com>
 <CAB6mnxLjLgF6Wm1eFtinskxZQ4syRJOVNVSPdq3ok=w7bBpxdw@mail.gmail.com>
 <CAMMTP+DNyH6e3A-ZpfjnYa3irrb8XodA7UEHRpfPUPSepbrZJg@mail.gmail.com>
 <CAB6mnxJn5vimQrVWUoMKqLGsbBrKD1pa5KcKJ9i8GPxDeghgCQ@mail.gmail.com>
 <CAH6Pt5rr6wA1ORmnDD7bF+fbGQ3N_f1_6ig8Uu5N0_Efu0wiqQ@mail.gmail.com>
 <CAMMTP+AmwQb744C_bnZnKGMJHdyE1hUJEjNojT2KorkT8Q7xFA@mail.gmail.com>
 <CAH6Pt5qcAZF9b-gP+rAAz3yvH7TP+Y7NkSPPk7k+o73gZ+VWqQ@mail.gmail.com>
 <CAMMTP+D=Xwe08Jj2hfYXEYeWz=K_MpzWvbknL9u-NThmYUD4pQ@mail.gmail.com>
 <CAMMTP+B72VFUK2JDv_YD=FkYN2ZZBLWmOV1qNrtz9N6d4Zoe_g@mail.gmail.com>
 <CAH6Pt5qdX_cnYhHqpH+WPusXy2xkqFmuEWH1ZXnJ1Y=VxBb1HQ@mail.gmail.com>
 <CAMMTP+ByL1Wwa6okPS9gFL3J7e7v5dWXRgse5k0FtYQ+F=_snA@mail.gmail.com>
Message-ID: <CAFvE7K4woh0zTnnY8PV6CzDQOiFD4ZU5v=hxV-n1HBNnBwp7BA@mail.gmail.com>

numpy.linalg.pinv uses the same solver as scipy.linalg.pinv2 (SVD)
while scipy.linalg.pinv uses the least squares solver.

-- 
Olivier


From david.mikolas1 at gmail.com  Mon Mar 21 11:54:13 2016
From: david.mikolas1 at gmail.com (David Mikolas)
Date: Mon, 21 Mar 2016 23:54:13 +0800
Subject: [SciPy-Dev] minimize reporting successful but not searching (with
	default method)
Message-ID: <CAFYhK+4hVk_+y7cGFLG7eAaa51XGCFuvCf-sqJ-jo-3tXqvwSQ@mail.gmail.com>

I have a situation where scipy.optimize.minimize returns the initial
starting value as the answer without really iterating, and returns
"successful" status, using default method.

It works fine if I specify Nelder-Mead, but my question is "why does
minimize return successful when it isn's, and what can I do (if anything)
to flag this behavior?

I originally posted in scipy-user but decided it may be more appropriate here.

The question is also here http://stackoverflow.com/q/36110998  and if
someone can also leave an answer there as well, that would be great! but
not necessary.

One particular - I'm using calls to Skyfield, which may or may not have
something to do with it.   (http://rhodesmill.org/skyfield/)

FULL OUTPUT using DEFAULT METHOD:
   status: 0
  success: True
     njev: 1
     nfev: 3
 hess_inv: array([[1]])
      fun: 1694.98753895812
        x: array([ 10000.])
  message: 'Optimization terminated successfully.'
      jac: array([ 0.])
      nit: 0

FULL OUTPUT using Nelder-Mead METHOD:
  status: 0
    nfev: 63
 success: True
     fun: 3.2179306044608054
       x: array([ 13053.81011963])
 message: 'Optimization terminated successfully.'
     nit: 28
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160321/f6dfd970/attachment.html>

From pav at iki.fi  Mon Mar 21 17:39:47 2016
From: pav at iki.fi (Pauli Virtanen)
Date: Mon, 21 Mar 2016 21:39:47 +0000 (UTC)
Subject: [SciPy-Dev] minimize reporting successful but not searching
 (with default method)
References: <CAFYhK+4hVk_+y7cGFLG7eAaa51XGCFuvCf-sqJ-jo-3tXqvwSQ@mail.gmail.com>
Message-ID: <ncppn2$pui$1@ger.gmane.org>

Mon, 21 Mar 2016 23:54:13 +0800, David Mikolas kirjoitti:
> It works fine if I specify Nelder-Mead, but my question is "why does
> minimize return successful when it isn's

Your function is piecewise constant ("staircase" pattern on a small 
scale).

The derivative at the initial point is zero.

The optimization method is a local optimizer.

Therefore: the initial point is a local optimum (in the sense understood 
by the optimizer i.e. satisfying the KKT conditions).

   ***

As your function is not differentiable, the KKT conditions don't mean 
much.

Note that it works with Nelder-Mead is sort of a coincidence --- it works 
because the "staircase" pattern of the function is on a scale smaller 
than the initial simplex size.

On the other hand, it fails for BFGS, because the default step size used 
for numerical differentiation happens to be smaller than the size of the 
"staircase" size.

> and what can I do (if anything) to flag this behavior?

Know whether your function is differentiable or not, and which 
optimization methods are expected to work.

There's no cheap general way of detecting --- based on the numerical 
floating-point values output by a function --- whether a function is 
everywhere differentiable or continuous. This information must come from 
the author of the function, and optimization methods can only assume it.

You can however make sanity checks, e.g., using different initial values 
to see if you're stuck at a local minimum.

If the function is not differentiable, optimizers that use derivatives, 
even numerically approximated, can be unreliable.

If the non-differentiability is on a small scale, you can cheat and choose 
a numerical differentiation step size large enough --- and hope for the 
best.

-- 
Pauli Virtanen


From andyfaff at gmail.com  Mon Mar 21 21:00:32 2016
From: andyfaff at gmail.com (Andrew Nelson)
Date: Tue, 22 Mar 2016 12:00:32 +1100
Subject: [SciPy-Dev] minimize reporting successful but not searching
 (with default method)
In-Reply-To: <ncppn2$pui$1@ger.gmane.org>
References: <CAFYhK+4hVk_+y7cGFLG7eAaa51XGCFuvCf-sqJ-jo-3tXqvwSQ@mail.gmail.com>
 <ncppn2$pui$1@ger.gmane.org>
Message-ID: <CAAbtOZdFOvaf=e8nOs108W3PKh2sAn92tzH+h8YfvWHxvdB0Ng@mail.gmail.com>

If you have a rough idea of lower/upper bounds for the parameters then you
could use differential_evolution, it uses a stochastic rather than gradient
approach. It will require many more function evaluations though.


On 22 March 2016 at 08:39, Pauli Virtanen <pav at iki.fi> wrote:

> Mon, 21 Mar 2016 23:54:13 +0800, David Mikolas kirjoitti:
> > It works fine if I specify Nelder-Mead, but my question is "why does
> > minimize return successful when it isn's
>
> Your function is piecewise constant ("staircase" pattern on a small
> scale).
>
> The derivative at the initial point is zero.
>
> The optimization method is a local optimizer.
>
> Therefore: the initial point is a local optimum (in the sense understood
> by the optimizer i.e. satisfying the KKT conditions).
>
>    ***
>
> As your function is not differentiable, the KKT conditions don't mean
> much.
>
> Note that it works with Nelder-Mead is sort of a coincidence --- it works
> because the "staircase" pattern of the function is on a scale smaller
> than the initial simplex size.
>
> On the other hand, it fails for BFGS, because the default step size used
> for numerical differentiation happens to be smaller than the size of the
> "staircase" size.
>
> > and what can I do (if anything) to flag this behavior?
>
> Know whether your function is differentiable or not, and which
> optimization methods are expected to work.
>
> There's no cheap general way of detecting --- based on the numerical
> floating-point values output by a function --- whether a function is
> everywhere differentiable or continuous. This information must come from
> the author of the function, and optimization methods can only assume it.
>
> You can however make sanity checks, e.g., using different initial values
> to see if you're stuck at a local minimum.
>
> If the function is not differentiable, optimizers that use derivatives,
> even numerically approximated, can be unreliable.
>
> If the non-differentiability is on a small scale, you can cheat and choose
> a numerical differentiation step size large enough --- and hope for the
> best.
>
> --
> Pauli Virtanen
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>


-- 
_____________________________________
Dr. Andrew Nelson


_____________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160322/58eabdfc/attachment.html>

From david.mikolas1 at gmail.com  Mon Mar 21 21:46:21 2016
From: david.mikolas1 at gmail.com (David Mikolas)
Date: Tue, 22 Mar 2016 09:46:21 +0800
Subject: [SciPy-Dev] minimize reporting successful but not searching
 (with default method)
In-Reply-To: <ncppn2$pui$1@ger.gmane.org>
References: <CAFYhK+4hVk_+y7cGFLG7eAaa51XGCFuvCf-sqJ-jo-3tXqvwSQ@mail.gmail.com>
 <ncppn2$pui$1@ger.gmane.org>
Message-ID: <CAFYhK+5=wswdJOEvbN1Jsx5UDPt_Or0eb4HsLjiJP1mLJUdEDw@mail.gmail.com>

Pauli thank you for taking the time to explain clearly and thoroughly. I've
updated the question in stackexchange by removing words like "wrong" and
"fail", and I've added a supplementary answer with a plot of the staircase
behavior of the JulianDate method.

http://stackoverflow.com/a/36144582
http://i.stack.imgur.com/CUuia.png

That's 40 microseconds on a span of thousands of years.

On Tue, Mar 22, 2016 at 5:39 AM, Pauli Virtanen <pav at iki.fi> wrote:

> Mon, 21 Mar 2016 23:54:13 +0800, David Mikolas kirjoitti:
> > It works fine if I specify Nelder-Mead, but my question is "why does
> > minimize return successful when it isn's
>
> Your function is piecewise constant ("staircase" pattern on a small
> scale).
>
> The derivative at the initial point is zero.
>
> The optimization method is a local optimizer.
>
> Therefore: the initial point is a local optimum (in the sense understood
> by the optimizer i.e. satisfying the KKT conditions).
>
>    ***
>
> As your function is not differentiable, the KKT conditions don't mean
> much.
>
> Note that it works with Nelder-Mead is sort of a coincidence --- it works
> because the "staircase" pattern of the function is on a scale smaller
> than the initial simplex size.
>
> On the other hand, it fails for BFGS, because the default step size used
> for numerical differentiation happens to be smaller than the size of the
> "staircase" size.
>
> > and what can I do (if anything) to flag this behavior?
>
> Know whether your function is differentiable or not, and which
> optimization methods are expected to work.
>
> There's no cheap general way of detecting --- based on the numerical
> floating-point values output by a function --- whether a function is
> everywhere differentiable or continuous. This information must come from
> the author of the function, and optimization methods can only assume it.
>
> You can however make sanity checks, e.g., using different initial values
> to see if you're stuck at a local minimum.
>
> If the function is not differentiable, optimizers that use derivatives,
> even numerically approximated, can be unreliable.
>
> If the non-differentiability is on a small scale, you can cheat and choose
> a numerical differentiation step size large enough --- and hope for the
> best.
>
> --
> Pauli Virtanen
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160322/87b0cea0/attachment.html>

From david.mikolas1 at gmail.com  Mon Mar 21 21:59:50 2016
From: david.mikolas1 at gmail.com (David Mikolas)
Date: Tue, 22 Mar 2016 09:59:50 +0800
Subject: [SciPy-Dev] minimize reporting successful but not searching
 (with default method)
In-Reply-To: <CAAbtOZdFOvaf=e8nOs108W3PKh2sAn92tzH+h8YfvWHxvdB0Ng@mail.gmail.com>
References: <CAFYhK+4hVk_+y7cGFLG7eAaa51XGCFuvCf-sqJ-jo-3tXqvwSQ@mail.gmail.com>
 <ncppn2$pui$1@ger.gmane.org>
 <CAAbtOZdFOvaf=e8nOs108W3PKh2sAn92tzH+h8YfvWHxvdB0Ng@mail.gmail.com>
Message-ID: <CAFYhK+7ys8MsvtBONSiJ2N=ZJ+ROC7CbSWYF8VHj4Jfby3DMUQ@mail.gmail.com>

Andrew,

Differential evolution is new to me and therefore by definition
interesting, thanks!

In this particular case I definitely want the local minimum (this eclipse),
and calls are expensive - there's a millisecond delay for each new instance
of Julian Date arrays

http://stackoverflow.com/q/35358401

But if I access the database directly, then this could be very interesting
to try to search for new events!

On Tue, Mar 22, 2016 at 9:00 AM, Andrew Nelson <andyfaff at gmail.com> wrote:

> If you have a rough idea of lower/upper bounds for the parameters then you
> could use differential_evolution, it uses a stochastic rather than gradient
> approach. It will require many more function evaluations though.
>
>
> On 22 March 2016 at 08:39, Pauli Virtanen <pav at iki.fi> wrote:
>
>> Mon, 21 Mar 2016 23:54:13 +0800, David Mikolas kirjoitti:
>> > It works fine if I specify Nelder-Mead, but my question is "why does
>> > minimize return successful when it isn's
>>
>> Your function is piecewise constant ("staircase" pattern on a small
>> scale).
>>
>> The derivative at the initial point is zero.
>>
>> The optimization method is a local optimizer.
>>
>> Therefore: the initial point is a local optimum (in the sense understood
>> by the optimizer i.e. satisfying the KKT conditions).
>>
>>    ***
>>
>> As your function is not differentiable, the KKT conditions don't mean
>> much.
>>
>> Note that it works with Nelder-Mead is sort of a coincidence --- it works
>> because the "staircase" pattern of the function is on a scale smaller
>> than the initial simplex size.
>>
>> On the other hand, it fails for BFGS, because the default step size used
>> for numerical differentiation happens to be smaller than the size of the
>> "staircase" size.
>>
>> > and what can I do (if anything) to flag this behavior?
>>
>> Know whether your function is differentiable or not, and which
>> optimization methods are expected to work.
>>
>> There's no cheap general way of detecting --- based on the numerical
>> floating-point values output by a function --- whether a function is
>> everywhere differentiable or continuous. This information must come from
>> the author of the function, and optimization methods can only assume it.
>>
>> You can however make sanity checks, e.g., using different initial values
>> to see if you're stuck at a local minimum.
>>
>> If the function is not differentiable, optimizers that use derivatives,
>> even numerically approximated, can be unreliable.
>>
>> If the non-differentiability is on a small scale, you can cheat and choose
>> a numerical differentiation step size large enough --- and hope for the
>> best.
>>
>> --
>> Pauli Virtanen
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> https://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>
>
>
> --
> _____________________________________
> Dr. Andrew Nelson
>
>
> _____________________________________
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160322/f9765ae5/attachment.html>

From charlesr.harris at gmail.com  Sun Mar 27 19:32:46 2016
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Sun, 27 Mar 2016 17:32:46 -0600
Subject: [SciPy-Dev] Numpy 1.11.0 released
Message-ID: <CAB6mnxJTxO_ZHDZxjcnsS0VAdw8ECwYZFtMLb_b2cPydy0t4sQ@mail.gmail.com>

Hi All,

I'm pleased to announce the release of Numpy 1.11.0. This release is the
result of six months of work comprising 371 merged pull requests submitted
by 112 authors containing many bug fixes and improvements. Highlights are:

   - The datetime64 type is now timezone naive.
   - A dtype parameter has been added to ``randint``.
   - Improved detection of two arrays possibly sharing memory.
   - Automatic bin size estimation for ``np.histogram``.
   - Speed optimization of A @ A.T and dot(A, A.T).
   - New function ``np.moveaxis`` for reordering array axes.

Source files and documentation can be found on Sourceforge
<https://sourceforge.net/projects/numpy/files/NumPy/1.11.0/>, while source
files and OS X wheels for Python 2.7, 3.3, 3.4, and 3.5 can be installed
from Pypi. Note that this is the last release to support Python 2.6, 3.2,
and 3.3.

Contributors are listed below in alphabetical order with the names of new
contributors starred.

Abdullah Alrasheed*
Aditya Panchal*
Alain*
Alex Griffing
Alex Rogozhnikov*
Alex Willmer
Alexander Heger*
Allan Haldane
Anatoly Techtonik*
Andrew Nelson
Anne Archibald
Antoine Pitrou
Antony Lee*
Behzad Nouri
Bertrand Lefebvre
Blake Griffith
Boxiang Sun*
Brigitta Sipocz*
Carl Kleffner
Charles Harris
Chris Hogan*
Christoph Gohlke
Colin Jermain*
Cong Ma*
Daniel
David Freese
David Sanders*
Dmitry Odzerikho*
Dmitry Zagorny*
Eric Larson*
Eric Moore
Ethan Kruse*
Eugene Krokhalev*
Evgeni Burovski*
Francis T. O'Donovan*
Fran?ois Boulogne*
Gabi Davar*
Gerrit Holl
Gopal Singh Meena*
Greg Yang*
Greg Young*
Gregory R. Lee
Griffin Hosseinzadeh*
Hassan Kibirige*
Holger Kohr*
Ian Henriksen
Iceman9*
Jaime Fernandez
James Camel*
Jason King*
John Bjorn Nelson*
John Kirkham
Jonathan Helmus
Jonathan Underwood*
Joseph Fox-Rabinovitz*
Julian Taylor
Julien Dubois*
Julien Lhermitte*
Julien Schueller*
J?rn Hees*
Konstantinos Psychas*
Lars Buitinck
Luke Zoltan Kelley*
MaPePeR*
Mad Physicist*
Mark Wiebe
Marten van Kerkwijk
Matthew Brett
Matthias Geier*
Maximilian Trescher*
Michael  K. Tran*
Michael Behrisch*
Michael Currie*
Michael L?ffler*
Nathaniel Hellabyte*
Nathaniel J. Smith
Nick Papior*
Nicolas Calle*
Nicol?s Della Penna*
Olivier Grisel
Pauli Virtanen
Peter Iannucci
Phaiax*
Ralf Gommers
Rehas Sachdeva*
Ronan Lamy
Ruediger Meier*
Ryan Grout*
Ryosuke Okuta*
R?my L?one*
Sam Radhakrishnan*
Samuel St-Jean*
Sebastian Berg
Simon Conseil*
Stephan Hoyer
Stuart Archibald*
Stuart Berg
Sumith*
Tapasweni Pathak*
Thomas Robitaille
Tobias Megies*
Tushar Gautam*
Varun Nayyar*
Vincent Legoll*
Warren Weckesser
Wendell Smith
Yash Mehrotra*
Yifan Li*
endolith
floatingpointstack*
ldoddema*
yolanda15*

Thanks to all who worked on this release.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160327/fffa00d7/attachment.html>

From evgeny.burovskiy at gmail.com  Sun Mar 27 19:44:19 2016
From: evgeny.burovskiy at gmail.com (Evgeni Burovski)
Date: Mon, 28 Mar 2016 00:44:19 +0100
Subject: [SciPy-Dev] Numpy 1.11.0 released
In-Reply-To: <CAB6mnxJTxO_ZHDZxjcnsS0VAdw8ECwYZFtMLb_b2cPydy0t4sQ@mail.gmail.com>
References: <CAB6mnxJTxO_ZHDZxjcnsS0VAdw8ECwYZFtMLb_b2cPydy0t4sQ@mail.gmail.com>
Message-ID: <CAMRo0isWzNVRPNy8Q92Ttd_Fm6y7m+fMwakN-qrj47kJ-GVx4w@mail.gmail.com>

Thank you for doing this Chuck.
28.03.2016 2:32 ???????????? "Charles R Harris" <charlesr.harris at gmail.com>
???????:

> Hi All,
>
> I'm pleased to announce the release of Numpy 1.11.0. This release is the
> result of six months of work comprising 371 merged pull requests submitted
> by 112 authors containing many bug fixes and improvements. Highlights are:
>
>    - The datetime64 type is now timezone naive.
>    - A dtype parameter has been added to ``randint``.
>    - Improved detection of two arrays possibly sharing memory.
>    - Automatic bin size estimation for ``np.histogram``.
>    - Speed optimization of A @ A.T and dot(A, A.T).
>    - New function ``np.moveaxis`` for reordering array axes.
>
> Source files and documentation can be found on Sourceforge
> <https://sourceforge.net/projects/numpy/files/NumPy/1.11.0/>, while
> source files and OS X wheels for Python 2.7, 3.3, 3.4, and 3.5 can be
> installed from Pypi. Note that this is the last release to support Python
> 2.6, 3.2, and 3.3.
>
> Contributors are listed below in alphabetical order with the names of new
> contributors starred.
>
> Abdullah Alrasheed*
> Aditya Panchal*
> Alain*
> Alex Griffing
> Alex Rogozhnikov*
> Alex Willmer
> Alexander Heger*
> Allan Haldane
> Anatoly Techtonik*
> Andrew Nelson
> Anne Archibald
> Antoine Pitrou
> Antony Lee*
> Behzad Nouri
> Bertrand Lefebvre
> Blake Griffith
> Boxiang Sun*
> Brigitta Sipocz*
> Carl Kleffner
> Charles Harris
> Chris Hogan*
> Christoph Gohlke
> Colin Jermain*
> Cong Ma*
> Daniel
> David Freese
> David Sanders*
> Dmitry Odzerikho*
> Dmitry Zagorny*
> Eric Larson*
> Eric Moore
> Ethan Kruse*
> Eugene Krokhalev*
> Evgeni Burovski*
> Francis T. O'Donovan*
> Fran?ois Boulogne*
> Gabi Davar*
> Gerrit Holl
> Gopal Singh Meena*
> Greg Yang*
> Greg Young*
> Gregory R. Lee
> Griffin Hosseinzadeh*
> Hassan Kibirige*
> Holger Kohr*
> Ian Henriksen
> Iceman9*
> Jaime Fernandez
> James Camel*
> Jason King*
> John Bjorn Nelson*
> John Kirkham
> Jonathan Helmus
> Jonathan Underwood*
> Joseph Fox-Rabinovitz*
> Julian Taylor
> Julien Dubois*
> Julien Lhermitte*
> Julien Schueller*
> J?rn Hees*
> Konstantinos Psychas*
> Lars Buitinck
> Luke Zoltan Kelley*
> MaPePeR*
> Mad Physicist*
> Mark Wiebe
> Marten van Kerkwijk
> Matthew Brett
> Matthias Geier*
> Maximilian Trescher*
> Michael  K. Tran*
> Michael Behrisch*
> Michael Currie*
> Michael L?ffler*
> Nathaniel Hellabyte*
> Nathaniel J. Smith
> Nick Papior*
> Nicolas Calle*
> Nicol?s Della Penna*
> Olivier Grisel
> Pauli Virtanen
> Peter Iannucci
> Phaiax*
> Ralf Gommers
> Rehas Sachdeva*
> Ronan Lamy
> Ruediger Meier*
> Ryan Grout*
> Ryosuke Okuta*
> R?my L?one*
> Sam Radhakrishnan*
> Samuel St-Jean*
> Sebastian Berg
> Simon Conseil*
> Stephan Hoyer
> Stuart Archibald*
> Stuart Berg
> Sumith*
> Tapasweni Pathak*
> Thomas Robitaille
> Tobias Megies*
> Tushar Gautam*
> Varun Nayyar*
> Vincent Legoll*
> Warren Weckesser
> Wendell Smith
> Yash Mehrotra*
> Yifan Li*
> endolith
> floatingpointstack*
> ldoddema*
> yolanda15*
>
> Thanks to all who worked on this release.
>
> Chuck
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160328/6712f6e8/attachment.html>

From evgeny.burovskiy at gmail.com  Mon Mar 28 06:29:26 2016
From: evgeny.burovskiy at gmail.com (Evgeni Burovski)
Date: Mon, 28 Mar 2016 11:29:26 +0100
Subject: [SciPy-Dev] RFC: sparse DOK array
Message-ID: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>

Hi,

TL;DR: Here's a draft implementation of a 'sparse array', suitable for
incremental
assembly. Comments welcome!
https://github.com/ev-br/sparr

Scipy sparse matrices are generally great. One area where they are not
very great
is the formats suitable for incremental assembly (esp LIL and DOK
matrices). This
has been discussed for a while, see, for instance,
https://github.com/scipy/scipy/pull/2908#discussion_r16955546

I spent some time to see how hard is it to build a dictionary-of-keys
data structure without Python overhead. Turns out, not very hard.

Also, when constructing a sparse data structure now, it seems to make sense
to make it a sparse *array* rather than *matrix*, so that it supports both
elementwise and matrix multiplication.

A draft implementation is available at https://github.com/ev-br/sparr and
I'd be grateful for any comments, large or small.

Some usage examples are in the README file which Github shows at the
front page above.

First and foremost, I'd like to gauge interest in the community ;-).
Does it actually make sense? Would you use such a data structure? What is
missing in the current version?

Short to medium term, some issues I see are:

* 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types.
I wonder if this is purely backwards compatibility. Naively, it seems to me
that a new class could just always use 64-bit indices, but this might
be too naive?

* Data types and casting rules. For now, I basically piggy-back on
numpy's rules.
There are several slightly different ones (numba has one?), and there might be
an opportunity to simplify the rules. OTOH, inventing one more subtly different
set of rules might be a bad idea.

* "Object" dtype. So far, there isn't one. I wonder if it's needed or having
only numeric types would be enough.

* Interoperation with numpy arrays and other sparse matrices. I guess
__numpy_ufunc__ would *the* solution here, when available.
For now, I do something simple based on special-casing and __array_priority__.
Sparse matrices almost work, but there are glitches.

This is just a non-exhaustive list of things I'm thinking about. If there's
something else (and I'm sure there is plenty!) I'm all ears --- either
in this email thread or in the issue tracker.
If someone feels like commenting on implementation details, sure, none of those
is cast in stone and plumbing is definitely in a bit of a flux.

Cheers,

Evgeni


From shoyer at gmail.com  Mon Mar 28 12:13:25 2016
From: shoyer at gmail.com (Stephan Hoyer)
Date: Mon, 28 Mar 2016 09:13:25 -0700
Subject: [SciPy-Dev] RFC: sparse DOK array
In-Reply-To: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
References: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
Message-ID: <CAEQ_TvewjOi_JcgMC-6iSUhXpk_G+z54R4BQy9YbtHQPVEr9RA@mail.gmail.com>

On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski <evgeny.burovskiy at gmail.com
> wrote:

> First and foremost, I'd like to gauge interest in the community ;-).
> Does it actually make sense? Would you use such a data structure? What is
> missing in the current version?
>

This looks awesome, and makes complete sense to me! In particular, xarray
could really use an n-dimensional sparse structure.

A few other things small things I'd like to see:
- Support for slicing, even if it's expensive.
- A strict way to set the shape without automatic expansion, if desired
(e.g., if shape is provided in the constructor).
- Default to the dtype of the fill_value. NumPy does this for np.full.


> Short to medium term, some issues I see are:
>
> * 32-bit vs 64-bit indices. Scipy sparse matrices switch between index
> types.
> I wonder if this is purely backwards compatibility. Naively, it seems to me
> that a new class could just always use 64-bit indices, but this might
> be too naive?
>
> * Data types and casting rules. For now, I basically piggy-back on
> numpy's rules.
> There are several slightly different ones (numba has one?), and there
> might be
> an opportunity to simplify the rules. OTOH, inventing one more subtly
> different
> set of rules might be a bad idea.
>

Yes, please follow NumPy.

* "Object" dtype. So far, there isn't one. I wonder if it's needed or having
> only numeric types would be enough.
>

This would be marginally useful -- eventually someone is going to want to
store some strings in a sparse array, and NumPy doesn't handle this very
well. Thus pandas, h5py and xarray all end up using dtype=object for
variable length strings. (pandas/xarray even take the monstrous approach of
using np.nan as a sentinel missing value.)


> * Interoperation with numpy arrays and other sparse matrices. I guess
> __numpy_ufunc__ would *the* solution here, when available.
> For now, I do something simple based on special-casing and
> __array_priority__.
> Sparse matrices almost work, but there are glitches.
>

Yes, __array_priority__ is about the best we can do now.

You could actually use a mix of __array_prepare__ and __array_wrap__ to
make (non-generalized) ufuncs work, e.g., for functions like np.sin:

- In __array_prepare__, return the non-fill values of the array
concatenated with the fill value.
- In __array_wrap__, reshape all but the last element to build a new sparse
array, using the last element for the new fill value.

This would be a neat trick and get you most of what you could hope for from
__numpy_ufunc__.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160328/0fa77b90/attachment.html>

From evgeny.burovskiy at gmail.com  Mon Mar 28 12:19:12 2016
From: evgeny.burovskiy at gmail.com (Evgeni Burovski)
Date: Mon, 28 Mar 2016 17:19:12 +0100
Subject: [SciPy-Dev] RFC: sparse DOK array
In-Reply-To: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
References: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
Message-ID: <CAMRo0isJAb5Vk9tKgXfkf6e3xqxV9SktkHLAdHnva-TOw8sPJw@mail.gmail.com>

For completeness I'd mention an alternative:
https://github.com/scipy/scipy/pull/6004
This is a completely independent effort framed better for the
scipy.sparse framework.


On Mon, Mar 28, 2016 at 11:29 AM, Evgeni Burovski
<evgeny.burovskiy at gmail.com> wrote:
> Hi,
>
> TL;DR: Here's a draft implementation of a 'sparse array', suitable for
> incremental
> assembly. Comments welcome!
> https://github.com/ev-br/sparr
>
> Scipy sparse matrices are generally great. One area where they are not
> very great
> is the formats suitable for incremental assembly (esp LIL and DOK
> matrices). This
> has been discussed for a while, see, for instance,
> https://github.com/scipy/scipy/pull/2908#discussion_r16955546
>
> I spent some time to see how hard is it to build a dictionary-of-keys
> data structure without Python overhead. Turns out, not very hard.
>
> Also, when constructing a sparse data structure now, it seems to make sense
> to make it a sparse *array* rather than *matrix*, so that it supports both
> elementwise and matrix multiplication.
>
> A draft implementation is available at https://github.com/ev-br/sparr and
> I'd be grateful for any comments, large or small.
>
> Some usage examples are in the README file which Github shows at the
> front page above.
>
> First and foremost, I'd like to gauge interest in the community ;-).
> Does it actually make sense? Would you use such a data structure? What is
> missing in the current version?
>
> Short to medium term, some issues I see are:
>
> * 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types.
> I wonder if this is purely backwards compatibility. Naively, it seems to me
> that a new class could just always use 64-bit indices, but this might
> be too naive?
>
> * Data types and casting rules. For now, I basically piggy-back on
> numpy's rules.
> There are several slightly different ones (numba has one?), and there might be
> an opportunity to simplify the rules. OTOH, inventing one more subtly different
> set of rules might be a bad idea.
>
> * "Object" dtype. So far, there isn't one. I wonder if it's needed or having
> only numeric types would be enough.
>
> * Interoperation with numpy arrays and other sparse matrices. I guess
> __numpy_ufunc__ would *the* solution here, when available.
> For now, I do something simple based on special-casing and __array_priority__.
> Sparse matrices almost work, but there are glitches.
>
> This is just a non-exhaustive list of things I'm thinking about. If there's
> something else (and I'm sure there is plenty!) I'm all ears --- either
> in this email thread or in the issue tracker.
> If someone feels like commenting on implementation details, sure, none of those
> is cast in stone and plumbing is definitely in a bit of a flux.
>
> Cheers,
>
> Evgeni


From perimosocordiae at gmail.com  Mon Mar 28 12:37:26 2016
From: perimosocordiae at gmail.com (CJ Carey)
Date: Mon, 28 Mar 2016 11:37:26 -0500
Subject: [SciPy-Dev] RFC: sparse DOK array
In-Reply-To: <CAMRo0isJAb5Vk9tKgXfkf6e3xqxV9SktkHLAdHnva-TOw8sPJw@mail.gmail.com>
References: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
 <CAMRo0isJAb5Vk9tKgXfkf6e3xqxV9SktkHLAdHnva-TOw8sPJw@mail.gmail.com>
Message-ID: <CAEfGn+ykv8RqzXmtG7WDvCZzp6pptXFL7pEvh3vkqdT2_=zipQ@mail.gmail.com>

Another alternative: https://github.com/perimosocordiae/sparray

I tried to follow the numpy API as closely as I could, to see if something
like this could work. The code isn't particularly optimized, but it does
beat CSR/CSC spmatrix formats on a few benchmarks. My plan going forward is
to have swappable (opaque) backends to allow for tradeoffs in speed/memory
for access/construction/math operations.

Apologies for the lacking documentation! The only good listing of current
capabilities at the moment is the test suite (
https://github.com/perimosocordiae/sparray/tree/master/sparray/tests).

On Mon, Mar 28, 2016 at 11:19 AM, Evgeni Burovski <
evgeny.burovskiy at gmail.com> wrote:

> For completeness I'd mention an alternative:
> https://github.com/scipy/scipy/pull/6004
> This is a completely independent effort framed better for the
> scipy.sparse framework.
>
>
> On Mon, Mar 28, 2016 at 11:29 AM, Evgeni Burovski
> <evgeny.burovskiy at gmail.com> wrote:
> > Hi,
> >
> > TL;DR: Here's a draft implementation of a 'sparse array', suitable for
> > incremental
> > assembly. Comments welcome!
> > https://github.com/ev-br/sparr
> >
> > Scipy sparse matrices are generally great. One area where they are not
> > very great
> > is the formats suitable for incremental assembly (esp LIL and DOK
> > matrices). This
> > has been discussed for a while, see, for instance,
> > https://github.com/scipy/scipy/pull/2908#discussion_r16955546
> >
> > I spent some time to see how hard is it to build a dictionary-of-keys
> > data structure without Python overhead. Turns out, not very hard.
> >
> > Also, when constructing a sparse data structure now, it seems to make
> sense
> > to make it a sparse *array* rather than *matrix*, so that it supports
> both
> > elementwise and matrix multiplication.
> >
> > A draft implementation is available at https://github.com/ev-br/sparr
> and
> > I'd be grateful for any comments, large or small.
> >
> > Some usage examples are in the README file which Github shows at the
> > front page above.
> >
> > First and foremost, I'd like to gauge interest in the community ;-).
> > Does it actually make sense? Would you use such a data structure? What is
> > missing in the current version?
> >
> > Short to medium term, some issues I see are:
> >
> > * 32-bit vs 64-bit indices. Scipy sparse matrices switch between index
> types.
> > I wonder if this is purely backwards compatibility. Naively, it seems to
> me
> > that a new class could just always use 64-bit indices, but this might
> > be too naive?
> >
> > * Data types and casting rules. For now, I basically piggy-back on
> > numpy's rules.
> > There are several slightly different ones (numba has one?), and there
> might be
> > an opportunity to simplify the rules. OTOH, inventing one more subtly
> different
> > set of rules might be a bad idea.
> >
> > * "Object" dtype. So far, there isn't one. I wonder if it's needed or
> having
> > only numeric types would be enough.
> >
> > * Interoperation with numpy arrays and other sparse matrices. I guess
> > __numpy_ufunc__ would *the* solution here, when available.
> > For now, I do something simple based on special-casing and
> __array_priority__.
> > Sparse matrices almost work, but there are glitches.
> >
> > This is just a non-exhaustive list of things I'm thinking about. If
> there's
> > something else (and I'm sure there is plenty!) I'm all ears --- either
> > in this email thread or in the issue tracker.
> > If someone feels like commenting on implementation details, sure, none
> of those
> > is cast in stone and plumbing is definitely in a bit of a flux.
> >
> > Cheers,
> >
> > Evgeni
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> https://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160328/e87e6ee0/attachment.html>

From evgeny.burovskiy at gmail.com  Mon Mar 28 17:37:51 2016
From: evgeny.burovskiy at gmail.com (Evgeni Burovski)
Date: Mon, 28 Mar 2016 22:37:51 +0100
Subject: [SciPy-Dev] RFC: sparse DOK array
In-Reply-To: <CAEQ_TvewjOi_JcgMC-6iSUhXpk_G+z54R4BQy9YbtHQPVEr9RA@mail.gmail.com>
References: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
 <CAEQ_TvewjOi_JcgMC-6iSUhXpk_G+z54R4BQy9YbtHQPVEr9RA@mail.gmail.com>
Message-ID: <CAMRo0iswu_bmhqncnPBQLWHEV5FNrLURHL5QipcfwN0DBGbeiQ@mail.gmail.com>

Thanks Stephan,


> A few other things small things I'd like to see:
> - Support for slicing, even if it's expensive.


Slicing is on the TODO list. Only needs a bit of plumbing work.

> - A strict way to set the shape without automatic expansion, if desired
> (e.g., if shape is provided in the constructor).

You can set the initial shape in the constructor. Or you mean a flag
to freeze the shape once it's set and have
`__setitem__(out-of-bounds)` raise an error?

> - Default to the dtype of the fill_value. NumPy does this for np.full.

Thanks for the suggestion --- done and implemented!


>> * Data types and casting rules. For now, I basically piggy-back on
>> numpy's rules.
>> There are several slightly different ones (numba has one?), and there
>> might be
>> an opportunity to simplify the rules. OTOH, inventing one more subtly
>> different
>> set of rules might be a bad idea.
>
>
> Yes, please follow NumPy.

One thing I'm wondering is the numpy rule that scalars never upcast
arrays. Is it something people actually rely on? [In my experience, I
only had to work around it, but my experience might be singular.]


> You could actually use a mix of __array_prepare__ and __array_wrap__ to make
> (non-generalized) ufuncs work, e.g., for functions like np.sin:
>
> - In __array_prepare__, return the non-fill values of the array concatenated
> with the fill value.
> - In __array_wrap__, reshape all but the last element to build a new sparse
> array, using the last element for the new fill value.
>
> This would be a neat trick and get you most of what you could hope for from
> __numpy_ufunc__.

This is really neat indeed! I've flagged it in
https://github.com/ev-br/sparr/issues/35

At the moment, I'm dealing with something much less cool, which I
suspect I'm not the first one: given m a MapArray and csr a
scipy.sparse matrix,

- m * csr produces a MapArray holding the result of the elementwise
multiplication, but
- csr * m fails with the dimension mismatch error when dimensions are
OK for elementwise multiply but not matrix multiply. The failure is
somewhere in the scipy.sparse code.

I tried playing with __array_priority__, but so far I did not manage
to convince scipy.sparse matrices to defer cleanly to the right-hand
multiplier (left-hand multiplier is OK).


Cheers,

Evgeni


From evgeny.burovskiy at gmail.com  Mon Mar 28 17:51:04 2016
From: evgeny.burovskiy at gmail.com (Evgeni Burovski)
Date: Mon, 28 Mar 2016 22:51:04 +0100
Subject: [SciPy-Dev] RFC: sparse DOK array
In-Reply-To: <CAEfGn+ykv8RqzXmtG7WDvCZzp6pptXFL7pEvh3vkqdT2_=zipQ@mail.gmail.com>
References: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
 <CAMRo0isJAb5Vk9tKgXfkf6e3xqxV9SktkHLAdHnva-TOw8sPJw@mail.gmail.com>
 <CAEfGn+ykv8RqzXmtG7WDvCZzp6pptXFL7pEvh3vkqdT2_=zipQ@mail.gmail.com>
Message-ID: <CAMRo0ivcFC1fXZPW_vbN2w=Wn4nZftiYP-Z+jSZSR6=xNSGD9w@mail.gmail.com>

On Mon, Mar 28, 2016 at 5:37 PM, CJ Carey <perimosocordiae at gmail.com> wrote:
> Another alternative: https://github.com/perimosocordiae/sparray
>
> I tried to follow the numpy API as closely as I could, to see if something
> like this could work. The code isn't particularly optimized, but it does
> beat CSR/CSC spmatrix formats on a few benchmarks. My plan going forward is
> to have swappable (opaque) backends to allow for tradeoffs in speed/memory
> for access/construction/math operations.
>
> Apologies for the lacking documentation! The only good listing of current
> capabilities at the moment is the test suite
> (https://github.com/perimosocordiae/sparray/tree/master/sparray/tests).


Wow, a great one from the Happy Valley!

I've quite a bit of catching up to do --- you are miles ahead of me in
terms of feature completeness.

One difference I see is that you seem to be using a flat index. This
likely does not play too well with assembly, which is what I tried to
take into account.

Would you be interested to join forces?

Cheers,

Evgeni


From perimosocordiae at gmail.com  Tue Mar 29 12:50:08 2016
From: perimosocordiae at gmail.com (CJ Carey)
Date: Tue, 29 Mar 2016 11:50:08 -0500
Subject: [SciPy-Dev] RFC: sparse DOK array
In-Reply-To: <CAMRo0ivcFC1fXZPW_vbN2w=Wn4nZftiYP-Z+jSZSR6=xNSGD9w@mail.gmail.com>
References: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
 <CAMRo0isJAb5Vk9tKgXfkf6e3xqxV9SktkHLAdHnva-TOw8sPJw@mail.gmail.com>
 <CAEfGn+ykv8RqzXmtG7WDvCZzp6pptXFL7pEvh3vkqdT2_=zipQ@mail.gmail.com>
 <CAMRo0ivcFC1fXZPW_vbN2w=Wn4nZftiYP-Z+jSZSR6=xNSGD9w@mail.gmail.com>
Message-ID: <CAEfGn+zO8CwYWm1jMiTtxeS2gxPgcTw=8UvEyohs6vygZpWryQ@mail.gmail.com>

>
>
> One difference I see is that you seem to be using a flat index. This
> likely does not play too well with assembly, which is what I tried to
> take into account.
>
>
Yeah, I took the easy route of using a single, sorted, flat index. This
makes a lot of operations easy, especially with the help of
np.ravel_multi_index and friends. You're correct that it makes assembly and
other modifications to the sparsity structure difficult, though! For
example:
https://github.com/perimosocordiae/sparray/blob/master/sparray/base.py#L335


> Would you be interested to join forces?
>


I've been meaning to add multiple backends to work around that problem, and
I expect that merging our efforts will make that possible. Let me know how
you want to proceed.

-CJ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160329/7444a059/attachment.html>

From travis at continuum.io  Tue Mar 29 16:28:58 2016
From: travis at continuum.io (Travis Oliphant)
Date: Tue, 29 Mar 2016 13:28:58 -0700
Subject: [SciPy-Dev] Blog-post that explains what Blaze actually is and
 where Pluribus project now lives.
Message-ID: <CAMcnTE4MboMhY6oaDRWjzK0ofNsX3h0gqDvNXTexcE2puB2QOw@mail.gmail.com>

I have emailed this list in the past explaining what is driving my open
source efforts now.

Here is a blog-post that may help some of you understand at little bit of
the history of Blaze, DyND Numba, and other related developments as they
relate to scaling up and scaling out array-computing in Python.

http://technicaldiscovery.blogspot.com/2016/03/anaconda-and-hadoop-story-of-journey.html

This post and these projects do not have anything to do with the future of
the NumPy and/or SciPy projects which are now in great hands guiding their
community-driven development.

The post is however, a discussion of additional projects that will
hopefully benefit some of you as well, and for which your feedback and
assistance is welcome.

Best,

-Travis


-- 

*Travis Oliphant, PhD*
*Co-founder and CEO*


@teoliphant
512-222-5440
http://www.continuum.io
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160329/857493f9/attachment.html>

From njs at pobox.com  Wed Mar 30 01:28:55 2016
From: njs at pobox.com (Nathaniel Smith)
Date: Tue, 29 Mar 2016 22:28:55 -0700
Subject: [SciPy-Dev] RFC: sparse DOK array
In-Reply-To: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
References: <CAMRo0itr_0PqzZObL7niY70n3D1dc2Kyk88wRG3=OMfhRibgHA@mail.gmail.com>
Message-ID: <CAPJVwBm1kZfEiTU9fRthYDRr4gHDKfSB9mxDgoaSpoUaP5=5zw@mail.gmail.com>

On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski
<evgeny.burovskiy at gmail.com> wrote:
> Hi,
>
> TL;DR: Here's a draft implementation of a 'sparse array', suitable for
> incremental
> assembly. Comments welcome!
> https://github.com/ev-br/sparr
>
> Scipy sparse matrices are generally great. One area where they are not
> very great
> is the formats suitable for incremental assembly (esp LIL and DOK
> matrices). This
> has been discussed for a while, see, for instance,
> https://github.com/scipy/scipy/pull/2908#discussion_r16955546
>
> I spent some time to see how hard is it to build a dictionary-of-keys
> data structure without Python overhead. Turns out, not very hard.
>
> Also, when constructing a sparse data structure now, it seems to make sense
> to make it a sparse *array* rather than *matrix*, so that it supports both
> elementwise and matrix multiplication.
>
> A draft implementation is available at https://github.com/ev-br/sparr and
> I'd be grateful for any comments, large or small.
>
> Some usage examples are in the README file which Github shows at the
> front page above.
>
> First and foremost, I'd like to gauge interest in the community ;-).
> Does it actually make sense? Would you use such a data structure? What is
> missing in the current version?

This is really cool! The continuing importance of scipy.sparse is
basically the only reason np.matrix hasn't been deprecated yet.

I've noodled around a little with designs for similar things at
various points, so a few general comments based on that that you can
take or leave :-)

- For general operations on n-dimensional sparse arrays, COO format
seems like the obvious thing to use. The strategy I've envisioned is
storing the data as N arrays of indices + 1 array of values (or
possibly you'd want to pack these together into a single array of
(struct containing N+1 values)), plus a bit of metadata indicating the
current sort order -- so sort_order=(0, 1) would mean that the items
are sorted lexicographically by row and then by column (so like CSC),
and sort_order=(1, 0) would mean the opposite. Then operations like
sum(axis=1) would be "sort by axis 1 (if not already there); then sum
over each contiguous chunk of the data"; sum(axis=(0, 2)) would be
"sort to make sure axes 0 and 2 are in the front (but I don't care in
which order)", etc. It may be useful to use a stable sort, so that if
you have sort_order=(0, 1, 2) and then sort by axis 2, you get
sort_order=(2, 0, 1). For binary operations, you ensure that the two
arrays have the same sort order (possibly re-sorting the smaller one),
and then do a parallel iteration over both of them finding matching
indices.

- Instead of proliferating data types for all kinds of different
storage formats, I'd really prefer to see a single data type that
"just works". So for example, instead of having one storage format
that's optimized for building matrices and one that's optimized for
operations on matrices, you could have a single object which has:

  - COO format representation
  - a DOK of "pending updates"

and writes would go into the DOK part, and at the beginning of each
non-trivial operation you'd call self._get_coo_arrays() or whatever,
which would do

  if self._pending_updates:
      # do a single batch reallocation and incorporate all the pending updates
  return self._real_coo_arrays

There are a lot of potentially viable variants on this using different
data structures for the various pieces, different detailed
reallocation strategies, etc., but you get the idea.

For interface with external C/C++/Fortran code that wants CSR or CSC,
I'd just have get_csr() and get_csc() methods that return a tuple of
(indptr, indices, data), and not even try to implement any actual
operations on this representation.

> * Data types and casting rules. For now, I basically piggy-back on
> numpy's rules.
> There are several slightly different ones (numba has one?), and there might be
> an opportunity to simplify the rules. OTOH, inventing one more subtly different
> set of rules might be a bad idea.

It would be really nice if you could just use numpy's own dtype
objects and ufuncs instead of implementing your own. I don't know how
viable that is right now, but I think it's worth trying, and if you
can't then I'd like to know what part broke :-). Happy to help if you
need some tips / further explanation.

-n


-- 
Nathaniel J. Smith -- https://vorpus.org


From cgodshall at enthought.com  Tue Mar 29 11:13:34 2016
From: cgodshall at enthought.com (Courtenay Godshall (Enthought))
Date: Tue, 29 Mar 2016 15:13:34 -0000
Subject: [SciPy-Dev] ANN: LAST CALL for SciPy (Scientific Python) 2016
	Conference Talk / Poster Proposals - Due Friday 4/1
Message-ID: <008801d189cd$9108d070$b31a7150$@enthought.com>

**ANN: LAST CALL for SciPy 2016 Conference Talk / Poster Proposals
(Scientific Computing with Python) - Final Deadline - Friday April 1st**

 <http://www.scipy2016.scipy.org/> SciPy 2016, the 15th annual Scientific
Computing with Python conference, will be held July 11-17, 2016 in Austin,
Texas. SciPy is a community dedicated to the advancement of scientific
computing through open source Python software for mathematics, science, and
engineering. The annual SciPy Conference brings together over 650
participants from industry, academia, and government to showcase their
latest projects, learn from skilled users and developers, and collaborate on
code development.

 
The full program will consist of
<http://scipy2016.scipy.org/ehome/146062/332967/?&&>  2 days of tutorials
(July 11-12), <http://scipy2016.scipy.org/ehome/146062/332968/?&&>  3 days
of talks (July 13-15), and 2 days of
<http://scipy2016.scipy.org/ehome/146062/332969/?&&>  developer sprints
(July 16-17). More info is available on the conference website at
<http://www.scipy2016.scipy.org/>  http://scipy2016.scipy.org (where you can
sign up for the mailing list); or follow <https://twitter.com/scipyconf>
@scipyconf on Twitter.  

 
----------------------------------------------------------------------------
---------------------------------

**SUBMIT A SCIPY 2016 TALK / POSTER PROPOSAL - DUE APRIL 1, 2016*

----------------------------------------------------------------------------
---------------------------------

Submissions for talks and posters are welcome on our website (
<http://scipy2016.scipy.org/> http://scipy2016.scipy.org). In your abstract,
please provide details on what Python tools are being employed, and how. The
talk and poster submission deadline is March 25th, 2016

 
SciPy 2016 will include 3 major topic tracks and 8 mini-symposia tracks.

 
Major topic tracks include:

- Scientific Computing in Python

- Python in Data Science (Big data and not so big data)

- High Performance Computing

 
Mini-symposia will include the applications of Python in:

- Earth and Space Science

- Engineering

- Medicine and Biology

- Social Sciences

- Special Purpose Databases

- Case Studies in Industry

- Education

- Reproducibility

 
If you have any questions or comments, feel free to contact us at:
scipy-organizers at scipy.org

 
-----------------------------------------------------------

**SCIPY 2016 REGISTRATION IS OPEN**

-----------------------------------------------------------

Please register early. SciPy early bird registration until May 22, 2016.
Register at <http://scipy2016.scipy.org/>  http://scipy2016.scipy.org. Plus,
enter our t-shirt design contest to win a free registration. (Send a vector
art file to scipy at enthought.com by March 31 to enter).

 
Important dates:

April 1: Talk and Poster Proposals Due

May 11: Plotting Contest Submissions Due

Apr 22: Tutorials Announced

Apr 22: Financial Aid Submissions Due

May 4: Talk and Posters Announced

May 22: Early Bird Registration Deadline

Jul 11-12: SciPy 2016 Tutorials

Jul 13-15: SciPy 2016 General Conference

Jul 16-17: SciPy 2016 Sprints

 
We look forward to an exciting conference and hope to see you in Austin in
July!

 
The Scipy 2016 Committee

 <http://scipy2016.scipy.org/> http://scipy2016.scipy.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160329/bc5a679b/attachment.html>