[Python-ideas] Caching iterators

Wed Feb 26 00:08:44 CET 2014

Note:

I PROMISE this is a better idea than my last 20 bad ones(raise_if, shlex
extra argument, etc.)

I'll use an example to illustrate the idea first. Let's use a completely
non-realistic and contrived example. Say you have a lexer that's really
slow. Now, this lexer might be a function that uses generators, i.e.:

def mylexer(input):
    while input:
       ...
       if xyz:
           yield SomeToken()

Now, if we have a parser that uses that lexer, it always has to wait for
the lexer to yield the next token *after* it already parsed the current
token. That can be somewhat time consuming.

Caching iterators are based on the idea: what if the iterator is running at
the same time as the function getting the iterator elements? Or, better
yet, it's an iterator wrapper that takes an iterator and continues to take
its elements while the function that uses the iterator is running? This is
easily accomplished using multiprocessing and pipes.

Since that was somewhat vague, here's an example:

def my_iterator():
    for i in range(0,5):
        time.sleep(0.2)
        yield i
for item in my_iterator():
    time.sleep(0.5)
    print(item)

Now with a normal iterator, the flow is like this:

   - Wait for my_iterator to return an element(0.2s)
   - Wait 0.5s and print the element(0.5s)

In total, that takes 0.7s per element. What a waste! What if the iterator
was yielding elements at the same time as the for loop was using them?
Well, for every for loop iteration, the iterator could generate ~2.2
elements. That's what a caching iterator does. It runs both at the same
time using multiprocessing. It's thread safe as long as the iterator
doesn't depend on whatever is using it. An example:

def my_iterator():
    for i in range(0,5):
        time.sleep(0.2)
        yield i
for item in itertools.CachingIterator(my_iterator()): # this is the only change
    time.sleep(0.5)
    print(item)

Now the flow is like this:

   - Wait for my_iterator to return the very first element.
   - While that first element is looped over, continue recieving elements
   from my_iterator(), storing them in an intermediate space(similar to a
   deque).
   - When the loop is completed, take the next element from the
   intermediate space and loop over it
   - While that element is looped over, continue recieving elements...

...and so forth. That way, time isn't wasted waited for the loop to finish.

I have a working implementation. Although there is a very slight overhead,
in the above example, about 0.4s is still saved.

There could also be an lmap function, which just does this:

def lmap(f,it):
    yield from map(f,CachingIterator(it))

Thoughts?

-- 
Ryan
If anybody ever asks me why I prefer C++ to C, my answer will be simple:
"It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was
nul-terminated."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140225/a3820ec7/attachment-0001.html>