[SciPy-dev] code for incremental least squares

Nathaniel Smith njs at pobox.com
Tue Feb 16 16:56:39 EST 2010


On Tue, Feb 16, 2010 at 12:37 PM,  <josef.pktd at gmail.com> wrote:
> 2010/2/16 Stéfan van der Walt <stefan at sun.ac.za>:
>> Hi Nathan
>>
>> On 16 February 2010 22:18, Nathaniel Smith <njs at pobox.com> wrote:
>>> I have a need for solving some very large least squares regression
>>> problems -- but fortunately my model matrix can be generated
>>> incrementally, so there are some tricks to do the important
>>> calculations without actually loading the whole thing into memory. The
>>> resulting code is rather general, and might be of wider interest, so I
>>> thought I'd send a note here and see what people thought about
>>> upstreaming it somehow...
>>
>> This sounds interesting!  Could you expand on the incremental
>> generation of the model matrix, and how it is made general?
>
> I have the same thought, Are you increasing by observation or by
> explanatory variable?

By observation. (This is good since hopefully you have more
observations than explanatory variables!) Actually generating the
model matrix in an incremental way is your problem; but as you
generate each 'strip' of the model matrix, you just hand it to the
code's 'update' method and then forget about it.

The basic idea inside 'update' is pretty elementary... if you have a
model matrix
  X = np.row_stack([X1, X2, X3, ...])
then what we need for least squares calculations is
  X'X = X1'X1 + X2'X2 + X3'X3 + ...
and we can compute this sum incrementally as the X_i matrices arrive.
Also, it is obviously easy to parallelize the matrix products (and
potentially the generation of the strips themselves, depending on your
situation), and those seem to be the bottleneck.

There's some linear algebra in working out how to calculate the
residual sum of squares and products without actually calculating the
residuals, you need to also accumulate X'y, weight handling, etc., but
no real magic.

For the QR code I use a recurrence relation I stole from some slides
by Simon Wood to compute R and Q'y incrementally; probably a real
incremental QR (e.g., "AS274", which is what R's biglm package uses)
would be better, but this one is easy to implement in terms of
non-incremental QR.

> For either case, I would be very interested to see this kind of
> incremental least squares in statsmodels. If you are able to license
> your parts as BSD, then I will look at it for sure.

I am.

> We have some plans for this but not yet implemented.

Oh? Do tell...

Anyway, attaching the code so you can see the details for yourselves.
It won't quite run as is, since it uses some utility routines I
haven't included, but you should be able to get the idea.

-- Nathaniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: incremental_qr.py
Type: text/x-python
Size: 4845 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20100216/77969d62/attachment.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: incremental_ls.py
Type: text/x-python
Size: 28500 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20100216/77969d62/attachment-0001.py>


More information about the SciPy-Dev mailing list