set and dict iteration

Mon Sep 3 21:26:35 EDT 2012

On Mon, 03 Sep 2012 13:04:23 -0700, Aaron Brady wrote:

> On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
>> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi at gmail.com>
>> wrote:
>> 
>> > We could use a Python long object for the version index to prevent
>> > overflow.  Combined with P. Rubin's idea to count the number of open
>> > iterators, most use cases still wouldn't exceed a single word
>> > comparison; we could reset the counter when there weren't any.
>> 
>> We could use a Python long; I just don't think the extra overhead is
>> justified in a data structure that is already highly optimized for
>> speed.  Incrementing and testing a C int is *much* faster than doing
>> the same with a Python long.
> 
> I think the technique would require two python longs and a bool in the
> set, and a python long in the iterator.
> 
> One long counts the number of existing (open) iterators.  Another counts
> the version.  The bool keeps track of whether an iterator has been
> created since the last modification, in which case the next modification
> requires incrementing the version and resetting the flag.

I think that is over-engineered and could be the difference between 
having the patch accepted and having it rejected. After all, many people 
will argue that the existing solution to the problem is good enough.

Dicts are extremely common, and your patch increases both the memory 
usage of every dict, and the overhead of every write operation 
(__setitem__, __delitem__, update). Only a very few dicts will actually 
need this overhead, for the rest it is waste. It is important to keep 
that waste to a minimum or risk having the patch rejected.

An unsigned C int can count up to 4,294,967,295. I propose that you say 
that is enough iterators for anyone, and use a single, simple, version 
counter in the dict and the iterator. If somebody exceeds that many 
iterators to a single dict or set, and the version field overflows by 
exactly 2**32 versions, the results are no worse than they are now. You 
won't be introducing any new bugs.

Complicating the solution is, in my opinion, unnecessary. Why should 
every set and dict carry the cost of incrementing TWO Python longs and a 
flag when just a single C int is sufficient for all realistic use-cases?

The opportunity for failure is extremely narrow:

- I must have an iterator over a dict or set;
- and I must have paused the iteration in some way;
- and then I must create exactly 2**32 other iterators to the 
  same dict or set;
- and at some point modify the dict or set
- and then restart the first iterator

at which point some items returned by the iterator *may* be duplicated or 
skipped (depends on the nature of the modifications).

-- 
Steven