[SciPy-dev] code for incremental least squares

josef.pktd at gmail.com josef.pktd at gmail.com
Tue Feb 16 17:56:32 EST 2010


On Tue, Feb 16, 2010 at 4:56 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Tue, Feb 16, 2010 at 12:37 PM,  <josef.pktd at gmail.com> wrote:
>> 2010/2/16 Stéfan van der Walt <stefan at sun.ac.za>:
>>> Hi Nathan
>>>
>>> On 16 February 2010 22:18, Nathaniel Smith <njs at pobox.com> wrote:
>>>> I have a need for solving some very large least squares regression
>>>> problems -- but fortunately my model matrix can be generated
>>>> incrementally, so there are some tricks to do the important
>>>> calculations without actually loading the whole thing into memory. The
>>>> resulting code is rather general, and might be of wider interest, so I
>>>> thought I'd send a note here and see what people thought about
>>>> upstreaming it somehow...
>>>
>>> This sounds interesting!  Could you expand on the incremental
>>> generation of the model matrix, and how it is made general?
>>
>> I have the same thought, Are you increasing by observation or by
>> explanatory variable?
>
> By observation. (This is good since hopefully you have more
> observations than explanatory variables!) Actually generating the
> model matrix in an incremental way is your problem; but as you
> generate each 'strip' of the model matrix, you just hand it to the
> code's 'update' method and then forget about it.
>
> The basic idea inside 'update' is pretty elementary... if you have a
> model matrix
>  X = np.row_stack([X1, X2, X3, ...])
> then what we need for least squares calculations is
>  X'X = X1'X1 + X2'X2 + X3'X3 + ...
> and we can compute this sum incrementally as the X_i matrices arrive.
> Also, it is obviously easy to parallelize the matrix products (and
> potentially the generation of the strips themselves, depending on your
> situation), and those seem to be the bottleneck.
>
> There's some linear algebra in working out how to calculate the
> residual sum of squares and products without actually calculating the
> residuals, you need to also accumulate X'y, weight handling, etc., but
> no real magic.
>
> For the QR code I use a recurrence relation I stole from some slides
> by Simon Wood to compute R and Q'y incrementally; probably a real
> incremental QR (e.g., "AS274", which is what R's biglm package uses)
> would be better, but this one is easy to implement in terms of
> non-incremental QR.
>
>> For either case, I would be very interested to see this kind of
>> incremental least squares in statsmodels. If you are able to license
>> your parts as BSD, then I will look at it for sure.
>
> I am.
>
>> We have some plans for this but not yet implemented.
>
> Oh? Do tell...

pandas has expanding ols, (besides moving ols) which is similar in the
idea that, for each new observation (or groups of observations seems
to be in your case), the ols estimate is calculated in a recursive
way.

However your code looks more efficient because of the incremental QR
updating, and that you update more summary/sufficient statistics, but
I just had a brief look and I don't remember the details of pandas.

My application is also the more in time series when validation of the
estimator requires continuous updating of the estimate. In pandas case
and in my case it is more computational efficiency that makes
incremental estimation attractive than memory requirements.
For this part I was looking more into using Kalman Filter, but,
although I have seen papers that use matrix decomposition to do more
efficient updating, my knowledge of recursive QR, or cholesky or ?? is
not great.

Actually, having more observations than explanatory variables is not
necessarily the only standard case anymore. In machine learning and in
econometrics in a "Data-Rich Environment", the number of observations
and variables might be of the same order. But none of the applications
that I know of in econometrics would struggle with memory problems.
There are many interesting cases for example for forecasting where
checking the forecast performance requires variable selection and
re-estimation in each period.

I will have a much closer look, but my impression is that the basic
structure is not so different from the model structure in statsmodels
that it wouldn't fit in. Also, I think that some of your functions
might also be useful for other models, and maybe also for pandas. And
I will be learning more about how to work with QR directly.

Thanks,

Josef

>
> Anyway, attaching the code so you can see the details for yourselves.
> It won't quite run as is, since it uses some utility routines I
> haven't included, but you should be able to get the idea.
>
> -- Nathaniel
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>



More information about the SciPy-Dev mailing list