[Numpy-discussion] lstsq functionality

Tue Jul 20 12:03:24 EDT 2010

On Tue, Jul 20, 2010 at 11:23 AM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Tue, Jul 20, 2010 at 8:32 AM, Skipper Seabold <jsseabold at gmail.com>
> wrote:
>>
>> On Tue, Jul 20, 2010 at 1:08 AM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>> >
>> >
>> > On Mon, Jul 19, 2010 at 9:40 PM, Keith Goodman <kwgoodman at gmail.com>
>> > wrote:
>> >>
>> >> On Mon, Jul 19, 2010 at 8:27 PM, Charles R Harris
>> >> <charlesr.harris at gmail.com> wrote:
>> >> >
>> >> >
>> >> > On Mon, Jul 19, 2010 at 9:02 PM, Keith Goodman <kwgoodman at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> On Mon, Jul 19, 2010 at 6:53 PM, Joshua Holbrook
>> >> >> <josh.holbrook at gmail.com> wrote:
>> >> >> > On Mon, Jul 19, 2010 at 5:50 PM, Charles R Harris
>> >> >> > <charlesr.harris at gmail.com> wrote:
>> >> >> >> Hi All,
>> >> >> >>
>> >> >> >> I'm thinking about adding some functionality to lstsq because I
>> >> >> >> find
>> >> >> >> myself
>> >> >> >> doing the same fixes over and over. List follows.
>> >> >> >>
>> >> >> >> Add weights so data points can be weighted.
>> >> >> >> Use column scaling so condition numbers make more sense.
>> >> >> >> Compute covariance approximation?
>> >> >> >>
>> >> >> >> Unfortunately, the last will require using svd since there no
>> >> >> >> linear
>> >> >> >> least
>> >> >> >> squares routines in LAPACK that also compute the covariance, at
>> >> >> >> least
>> >> >> >> that
>> >> >> >> google knows about.
>> >> >> >>
>> >> >> >> Thoughts?
>> >> >> >
>> >> >> > Maybe make 2 functions--one which implements 1 and 2, and another
>> >> >> > which implements 3? I think weights is an excellent idea!
>> >> >>
>> >> >> I'd like a lstsq that did less, like not calculate the sum of
>> >> >> squared
>> >> >> residuals. That's useful in tight loops. So I also think having two
>> >> >> lstsq makes sense. One as basic as possible; one with bells. How
>> >> >> does
>> >> >> scipy's lstsq fit into all this?
>> >> >
>> >> > I think the computation of the residues is cheap in lstsq. The
>> >> > algolrithm
>> >> > used starts by reducing the design matrix to bidiagonal from and
>> >> > reduces
>> >> > the
>> >> > rhs at the same time. In other words an mxn problem becomes a (n+1)xn
>> >> > problem. That's why the summed square of residuals is available but
>> >> > not
>> >> > the
>> >> > individual residuals, after the reduction there is only one residual
>> >> > and
>> >> > its
>> >> > square it the residue.
>> >>
>> >> That does sound good. But it must take some time. There's indexing,
>> >> array creation, if statement, summing:
>> >>
>> >>        if results['rank'] == n and m > n:
>> >>            resids = sum((transpose(bstar)[n:,:])**2,
>> >> axis=0).astype(result_t)
>> >>
>> >> Here are the timings after removing the sum of squared residuals:
>> >>
>> >> >> x = np.random.rand(1000,10)
>> >> >> y = np.random.rand(1000)
>> >> >> timeit np.linalg.lstsq(x, y)
>> >> 1000 loops, best of 3: 369 us per loop
>> >> >> timeit np.linalg.lstsq2(x, y)
>> >> 1000 loops, best of 3: 344 us per loop
>> >> >>
>> >> >> x = np.random.rand(10,2)
>> >> >> y = np.random.rand(10)
>> >> >> timeit np.linalg.lstsq(x, y)
>> >> 10000 loops, best of 3: 102 us per loop
>> >> >> timeit np.linalg.lstsq2(x, y)
>> >> 10000 loops, best of 3: 77 us per loop
>> >> _
>> >
>> > Now that I've looked through the driver program I see that you are
>> > right.
>> > Also, all the info needed for the covariance is almost available in the
>> > LAPACK driver program. Hmm, it seems that maybe the best thing to do
>> > here is
>> > to pull out the lapack_lite driver program, modify it, and make it a
>> > standalone python module that links to either the lapack_lite programs
>> > or
>> > the ATLAS versions.  But that is more work than just doing things in
>> > python
>> > :-\ It does have the added advantage that all the work arrays can be
>> > handled
>> > internally.
>> >
>>
>> IIUC, I tried this with statsmodels earlier this summer, but found the
>> result to be slower than our current implementation (and surely slower
>> than lstsq since we have a bit more Python overhead around the LAPACK
>> function calls).  I will have to dig through the revisions and make
>> sure about this.
>>
>
> You modified the C version of dgelsd in lapack_lite? That is the driver I'm
> talking about.
>

Ah, then, no, I didn't understand correctly, but I see it in the source now.

>>
>> But for the record, we have weighted least squares (WLS) and you can
>> get the covariance matrix.  I think you can do column-scaling with our
>> GLS (generalized least squares class).  I haven't seen the terms
>> row-scaling and column-scaling used much in econometrics books do you
>> have a good reference offhand?
>>
>
> The dgelsd driver scales the matrix, but only enough to keep the numbers in
> the working range. Column scaling affects the singular values, so if you are
> using them to track the condition number of the matrix it can effect the
> outcome. The deeper thing that is going on is that some of the usefulness of
> the svd is based on the assumption that the euclidean distance between
> parameter vectors means something. I just tend to normalize the columns to
> unit length. It is a bit like using z-scores in statistics except the column
> norm is normalized instead of the column variance. I'm sure there are
> references out there but a quick google doesn't show anything obvious.
> Normalization seems to be a bit neglected in the usual classroom notes.
>

That makes sense.  Thanks,

Skipper