Reducing "yield from" overhead in recursive generators

Thu Mar 17 23:14:49 EDT 2022

Summary: I am looking for feedback on a potential technique for coding
recursive generators.

Often times the most convenient algorithms for creating a sequence of
results are recursive, while the most convenient interface for the
consumption of those results is an iterator.  PEP 380 introduced the
"yield from" syntax which provides a technique for generators to
delegate to other generators, including recursively.

The trouble is that recursion and generators don't necessarily play
all that well together from a performance point of view.  As Greg
Ewing noted in PEP 380:

	The overhead of passing __next__() calls and yielded values
	down and up the chain can cause what ought to be an O(n)
	operation to become, in the worst case, O(n**2).

Ewing goes on to describe some possible optimizations possible in
Python implementations.  However, in the Python versions I have
checked, there is still some O(depth) overhead to a chain of "yield
from" calls.

To make this concrete, let's consider a problem with a canonical
recursive implementation -- an inorder tree traversal.  (My actual
motivation for thinking about this issue comes from implementing
generators for various kinds of combinatorial objects.  But tree
traversal provides a good familiar example.)

For discussion purposes, let's represent a tree node as a tuple of the
form (key left right).  Then, a simple tree might be

```Python
tree1 = ('b', ('a', None, None), ('c', None, None))
```

Or, a badly unbalanced example could be:

```python
tree2 = ('a', None, 
         ('b', None,
          ('c', None,
           ('d', None,
            ('e', None,
             ('f', None,
              ('g', None,
               ('h', None,
                ('i', None, None)))))))))
```

Then, we can code inorder traversal as:

```python
def inorder_traverse(node):
    k, l, r = node
    if l:
        yield from inorder_traverse(l)
    yield k
    if r:
        yield from inorder_traverse(r)
```

So far, all very familiar.  The trouble is that if you are at a depth
of d, every value that is yielded by the algorithm gets passed through
d stack frames before it reaches the actual consumer of the data.
This issue, specifically the case of tree traversal, was the subject
of a 2016 thread in comp.lang.python "Linear Time Tree Traversal
Generator".  I am not reviving that 5+-year-old thread since my
objectives are a little different.

In that thread, Ned Batchelder suggested (and gave a nice
implementation for) reimplementing the traversal to use an explicit
stack and a loop to traverse the tree, so that "each yield directly
produces a value to the caller."  That, to my mind, basically solved
the original poster's problem -- the code for the iterative version is
reasonably concise and understandable, and is faster than the
recursive version both in the O(n) sense and also for small problems.

But, when the candidate recursive algorithm is more complicated than
that for tree traversal, translating to an iterative version can be
difficult.  Maintaining a stack of local variables is not too hard,
but converting a recursive function to iterative completely changes
the flow, such that familiar control structures become a tangled mass
of GOTOs.  (Which is even harder in a language without a GOTO.)  The
result (at least when I have tried) can be hard to write and (worse)
hard to read.

I was looking at the problem and realized that there might be a middle
ground -- avoiding the O(depth) cost of a chain of "yield from"
clauses, but still keeping the recursive code looking reasonably close
to the original.

The idea is to modify the recursive function by replacing the "yield
from" forms with "yield".  The transformed function needs to run from
a driver, which simulates the normal call stack.  The driver checks
each result yielded from the function -- if it is a regular value, it
is returned (yielded) to the caller immediately, but if the result is
a generator, it is pushed onto that stack.  The Python code for such a
driver is easier to understand than any english description I have
been able to create, so here it is:

```python
def recursive_generator_driver(g):
    """g is a generator object, which is likely to call other generators"""
    stack = [g]
    while True:
        g = stack[-1]
        try:
            val = next(g)
            if isinstance(val, types.GeneratorType):
                # Yielded value represented a recursive call, so push
                stack.append(val)
            else:
                # Regular value for iterator, give to caller
                yield val
        except StopIteration:
            stack.pop()
            if len(stack) == 0:
                return
```

and the modified tree traverser is just

```python
def inorder_traverse_mod(node):
    """ONLY for use with recursive_generator_driver"""
    k, l, r = node
    if l:
        yield inorder_traverse_mod(l) # NOT yield from 
    yield k
    if r:
        yield inorder_traverse_mod(r) # NOT yield from
```

This seems to work correctly (at least in my testing).  But does
saving the "yield from" cost overcome the cost of the driver function
with its explicit stack and isinstance() calls?

The tradeoffs depend on the specific Python implementation and the
algorithm that is to be modified.  For the tree traversal example,
with PyPy (Ubuntu, 7.3.1) the driver version becomes faster with a
complete binary tree of height 12 (8191 nodes) while for CPython
3.8.10 it is not until a height of 22 that the driver version
dominates.

A perfectly balanced tree is the a best case for the "yield from" code.
With a sufficiently badly balanced tree, the driver version can be
many times faster.

So what do people think of this hack/technique?  Is it

A) A terrible idea?  (Please be specific.)
B) Already well-known (and I just missed it.)
C) Potentially useful for the niche case of deeply recursive
   generators?

Some additional notes (and thanks if you are still reading this long posting):

1) The same driver can be used with different recursive functions, so
   this is easy to try out.

2) If a function can be converted to be purely iterative, (as with
   Batchelder's tree traversal) it will often be faster than either
   the driver or yield from versions.  Hence the driver version would
   be used for those functions where conversion to an iterative
   version would be more difficult.

3) If you don't like the isinstance() calls, the type can be signaled
   some other way.  For example the yield statements can be written to
   pass back a pair, the first element of which is a flag indicating
   whether the second element is a result value or a generator.

4) This approach means that recursion depth is not restricted by the
   size of the Python stack.  That might be a benefit, but if not, the
   driver could be modified to enforce a maximum recursion depth.