[Python-ideas] Fw: Idea: Compressing the stack on the fly

Mon May 27 07:23:01 CEST 2013

Sorry, my reply got bounced because it went to the Google Groups address instead of the regular address. Let me try again. For those who don't want to read the whole thing, because others have since made many of the same points: I don't think this is likely to be a good optimization for CPython, much less something to require of all implementations… but it could be an interesting experiment for PyPy, especially when combined with the existing Stackless and JIT features.

From: Ram Rachum <ram.rachum at gmail.com>

Sent: Sunday, May 26, 2013 5:00 AM

> def factorial_recursive(n):
>>     if n == 1:
>>         return 1
>>     return n * factorial_recursive(n - 1)

This is not a tail-recursive implementation. It doesn't just return the 
value of a recursive call, it performs further computation on that value before 
returning it. You need to do something like this:

def factorial_recursive(n, accum=1):
    if n == 1:
        return accum
    return factorial_recursive(n - 1, n * accum)

(You may notice that the transformation from naive-recursive to tail-recursive 
looks a lot like the transformation to iterative.)

> But then I thought, maybe you could do something smarter than eliminating 
> individual stack frames. Maybe we could create something that is to the current 
> implementation of the stack what `xrange` is to the old-style `range`. A smart 
> object that allows access to any of a long list of items in it, without actually 
> having to store those items. This would solve the first argument that Guido 
> raises in his post, which I found to be the most substantial one.

One key difference here: xrange has a constant size. No matter how big the range 
is, you only need to store 3 values. Your suggestion of storing each of the 
separate n values means your stack is still linear. You might get a 5x, 10x, or 
even 50x, improvement out of it—but it would still grow linearly. And there 
aren't that many programs for which 1000 is not enough but 10000 would be.

> So what I'm suggesting is an algorithm to compress that stack on the 
> fly. An algorithm that would detect regularities in the stack and instead of 
> saving each individual frame, save just the pattern. Then, there wouldn't be 
> any problem with showing informative stack trace: Despite not storing every 
> individual frame, each individual frame could still be accessed, similarly to 
> how `xrange` allow access to each individual member without having to store each 
> of them.

If you look at the way function calls are implemented in CPython, this would be 
very difficult.

However, if you look at Stackless CPython, or, especially, PyPy, that's 
another story. It's worth noting that storing the stack on the heap already 
gives you the ability to easily raise the recursion depth 50x, and nobody cares 
much about that—but people care a lot about the other benefits of the stackless 
implementation. So, your idea might be interesting even if your original 
motivation turns out not to be. For example, a 10x smaller stack might mean 10x 
as much cache locality, or your compression might provide enough feedback for 
the PyPy JIT to take advantage of.

Also, you might want to look at how generator functions are implemented. Your 
original code can't be TCO'd because it has a non-empty continuation 
after the recursive call. But if you think about it, a set of local variables 
and a continuation point is basically what a generator state stores. If you take 
this far enough, a JIT might be able to optimize out not only tail recursive 
calls, but some non-tail-recursive implementations like yours.