Dangerous behavior of list(generator)

Thu Dec 31 20:47:17 EST 2009

On Thu, 31 Dec 2009 11:34:39 -0800, Tom Machinski wrote:

> On Wed, Dec 30, 2009 at 4:01 PM, Steven D'Aprano
> <steve at remove-this-cybersource.com.au> wrote:
>> On Wed, 30 Dec 2009 15:18:11 -0800, Tom Machinski wrote:
>>> Bottom line, I'm going to have to remove this pattern from my code:
>>>
>>>   foo = (foo for foo in foos if foo.bar).next()
>>
>> I don't see why. What's wrong with it? Unless you embed it in a call to
>> list, or similar, it will explicitly raise StopIteration as expected.
> 
> Exactly; this seems innocuous, but if some caller of this code uses it
> in a list() constructor, a very subtle and dangerous bug is introduced -
> see OP. This is the entire point of this post.

Then don't use it in a list() constructor. 

That's a glib answer, of course. A better answer is to point out that the 
problem is not with the above expression, but with letting StopIteration 
bubble up as an error exception instead of dealing with it immediately. 
That's not what it's for, and you can't trust it not to be captured by 
something. If StopIteration represents an error condition, you need to 
deal with it immediately and convert it to an exception which isn't 
likely to disappear.

> In a large, non-trivial application, you simply cannot afford the
> assumption that no caller will ever do that. Even if you have perfect
> memory, some of your other developers or library users may not.

You shouldn't put the responsibility of dealing with the StopIteration on 
the caller, because StopIteraction is a signal not an error condition, 
and you can't tell when that signal will disappear. The responsibility 
lies on the writer of the function containing the line (that is, the 
Original Poster of this thread).

So you need something like this:

def my_function():
    try:
        foo = (foo for foo in foos if foo.bar).next()
    except StopIteration:
        handle_empty_foos()
    else:
        handle_at_least_one_foo()

handle_empty_foos may be as simple as raising a new exception and letting 
that bubble up to whatever layer of the application is expected to deal 
with it.

> As for what's wrong with the "if not any" solution, Benjamin Kaplan's
> post hits the nail on its head. This is a bioinformatics application, so
> the iterable "foos" tends to be very large, so saving half the runtime
> makes a big difference.

Possibly you haven't seen my reply to Benjamin, so I'll paraphrase: 
that's incorrect, because any() is lazy and will return as soon as it 
hits a non-false item. See the docs:

http://docs.python.org/library/functions.html#any

If the foo items are considered true (e.g. non-empty strings), then you 
can guarantee that any() will return on the very first item.

If the foo items are arbitrary objects which have an equal chance of 
being considered true or false, then on average it will have to look at 
half the list, which is O(N) and may be a tad expensive for large N. But 
how likely is that? One has to be realistic here, and consider the type 
of data you realistically need to deal with and not pathological cases. 
There's no limit to the problems you may have with sufficiently 
pathological data:

class Evil(object):
    @property
    def bar(self):
        import time
        time.sleep(1e8)
        return True

foos = [Evil(), "a", "b", "c", "d"]
foo = (foo for foo in foos if foo.bar).next()

any() is the standard, idiomatic solution for solving this sort of 
problem. Before rejecting it on the basis of slowness, you need to 
determine that long runs of false items ahead of the first true item is a 
realistic scenario, and that calling any() really is a bottleneck. 
Anything less is premature optimization.

-- 
Steven