[SciPy-dev] code for incremental least squares

Tue Feb 16 23:51:40 EST 2010

On Tue, Feb 16, 2010 at 5:56 PM,  <josef.pktd at gmail.com> wrote:
> On Tue, Feb 16, 2010 at 4:56 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> On Tue, Feb 16, 2010 at 12:37 PM,  <josef.pktd at gmail.com> wrote:
>>> 2010/2/16 Stéfan van der Walt <stefan at sun.ac.za>:
>>>> Hi Nathan
>>>>
>>>> On 16 February 2010 22:18, Nathaniel Smith <njs at pobox.com> wrote:
>>>>> I have a need for solving some very large least squares regression
>>>>> problems -- but fortunately my model matrix can be generated
>>>>> incrementally, so there are some tricks to do the important
>>>>> calculations without actually loading the whole thing into memory. The
>>>>> resulting code is rather general, and might be of wider interest, so I
>>>>> thought I'd send a note here and see what people thought about
>>>>> upstreaming it somehow...
>>>>
>>>> This sounds interesting!  Could you expand on the incremental
>>>> generation of the model matrix, and how it is made general?
>>>
>>> I have the same thought, Are you increasing by observation or by
>>> explanatory variable?
>>
>> By observation. (This is good since hopefully you have more
>> observations than explanatory variables!) Actually generating the
>> model matrix in an incremental way is your problem; but as you
>> generate each 'strip' of the model matrix, you just hand it to the
>> code's 'update' method and then forget about it.
>>
>> The basic idea inside 'update' is pretty elementary... if you have a
>> model matrix
>>  X = np.row_stack([X1, X2, X3, ...])
>> then what we need for least squares calculations is
>>  X'X = X1'X1 + X2'X2 + X3'X3 + ...
>> and we can compute this sum incrementally as the X_i matrices arrive.
>> Also, it is obviously easy to parallelize the matrix products (and
>> potentially the generation of the strips themselves, depending on your
>> situation), and those seem to be the bottleneck.
>>
>> There's some linear algebra in working out how to calculate the
>> residual sum of squares and products without actually calculating the
>> residuals, you need to also accumulate X'y, weight handling, etc., but
>> no real magic.
>>
>> For the QR code I use a recurrence relation I stole from some slides
>> by Simon Wood to compute R and Q'y incrementally; probably a real
>> incremental QR (e.g., "AS274", which is what R's biglm package uses)
>> would be better, but this one is easy to implement in terms of
>> non-incremental QR.
>>
>>> For either case, I would be very interested to see this kind of
>>> incremental least squares in statsmodels. If you are able to license
>>> your parts as BSD, then I will look at it for sure.
>>
>> I am.
>>
>>> We have some plans for this but not yet implemented.
>>
>> Oh? Do tell...
>
> pandas has expanding ols, (besides moving ols) which is similar in the
> idea that, for each new observation (or groups of observations seems
> to be in your case), the ols estimate is calculated in a recursive
> way.
>
> However your code looks more efficient because of the incremental QR
> updating, and that you update more summary/sufficient statistics, but
> I just had a brief look and I don't remember the details of pandas.
>
> My application is also the more in time series when validation of the
> estimator requires continuous updating of the estimate. In pandas case
> and in my case it is more computational efficiency that makes
> incremental estimation attractive than memory requirements.
> For this part I was looking more into using Kalman Filter, but,
> although I have seen papers that use matrix decomposition to do more
> efficient updating, my knowledge of recursive QR, or cholesky or ?? is
> not great.
>
> Actually, having more observations than explanatory variables is not
> necessarily the only standard case anymore. In machine learning and in
> econometrics in a "Data-Rich Environment", the number of observations
> and variables might be of the same order. But none of the applications
> that I know of in econometrics would struggle with memory problems.
> There are many interesting cases for example for forecasting where
> checking the forecast performance requires variable selection and
> re-estimation in each period.
>
> I will have a much closer look, but my impression is that the basic
> structure is not so different from the model structure in statsmodels
> that it wouldn't fit in. Also, I think that some of your functions
> might also be useful for other models, and maybe also for pandas. And
> I will be learning more about how to work with QR directly.
>
> Thanks,
>
> Josef
>
>>
>> Anyway, attaching the code so you can see the details for yourselves.
>> It won't quite run as is, since it uses some utility routines I
>> haven't included, but you should be able to get the idea.

just a BTW:

I found this comment in incremental_ls

        # R and Scipy disagree by about 3 orders of magnitude here:
        #  pf(4704.7767809675142416, 4, 12, lower.tail=FALSE) == 4.68-19
        #  1 - f.cdf(4704.7767809675142416, 4, 12) == 1.11-16
        # which I guess is most likely Scipy having limited resolution in the
        # tails.


>>> from scipy import stats
>>> stats.f.sf(4704.7767809675142416, 4, 12)
4.6848221938640787e-019

some of our tails are also pretty good

Josef

>>
>> -- Nathaniel
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>