[Python-Dev] Re: Reiterability

Sun Oct 19 06:05:56 EDT 2003

On Sunday 19 October 2003 00:05, Guido van Rossum wrote:
   ...
> > class ReiterableIterator(object):
> >     def __init__(self, thecallable, *itsargs, **itskwds):
   ...
> Why put support for a callable with arbitrary arguments in the
> ReiterableIterator class?  Why not say it's called without args, and
> if the user has a need to use something with args, they can use one of
> the many approaches to currying?

The typical and most frequent case would be that generating a
new iterator requires calling iter(asequence) -- i.e., the typical case
does require arguments.  So, just like e.g. for threading.Thread, 
atexit.register, and other callables that take a callable argument, it 
makes more sense to NOT require the user to invent a currying 
approach (note btw that iter does NOT support the iter.__get__ trick,
of course, as it's a builtin function and not a Python function).  It
would be different if Python supported a curry built-in, but it doesn't.

> > typical toy example use:
   ...
> Are there any non-toy examples?

I have not met any, yet -- whence my interest in hearing about use cases
from anybody who might have.

> I'm asking because I can't remember ever having had this need myself.

Right, me neither.

> A better name would be clone(); copy() would work too, as long as it's
> clear that it copies the iterator, not the underlying sequence or
> series.  (Subtle difference!)
>
> Reiteration is a special case of cloning: simply stash away a clone
> before you begin.

Good name, and good point.

> > Roughly the usual "protocol" approach -- functions use an object's
> > ability IF that object exposes methods providing that ability, and
> > otherwise fake it on their own.
>
> In this case I'm not sure if it is desirable to do this automatically.

Ah, yes, the automatism might be a performance trap -- good point.

> If I request a clone of an iterator for a data stream coming from a
> pipe or socket, it would have to start buffering everything.  Sure, I
> can come up with a buffering class that throws away buffered data that
> none of the existing clones can reach, but I very much doubt if it's
> worth it; a customized buffering scheme for the application at hand
> would likely be more efficient than a generic solution.

Then clone(it) should raise an exception if it does NOT expose a
method supplying "easy cloning" (or more simply it.clone() could
do it, e.g. an AttributeError:-) alerting the user of the need to use
such a "buffering class" wrapper:
    try: clo = it.clone()
    except AttributeError: clo = BufferingWrapper(it)

But if no existing iterator supplies the .clone -- even when it would
be very easy for it to do so -- this would bufferingwrap everything.

> > > I'm not sure what you are suggesting here.  Are you proposing that
> > > *some* iterators (those which can be snapshotted cheaply) sprout a
> > > new snapshot() method?
> >
> > If snapshottability (eek!) is important enough, yes, though
> > __snapshot__ might perhaps be more traditional (but for iterators we do
> > have the precedent of method next without __underscores__).
>
> (Which I've admitted before was a mistake.)

Ah, I didn't recall that admission, sorry.  OK, underscores everywhere then.

> A problem I have with making iterator cloning a standard option is
> that this would pretty much require that all iterators for which
> cloning can be implemented should implement clone().  That in turn
> means that iterator implementors have to work harder (sometimes
> cloning can be done cheaply, but it might require a different
> refactoring of the iterator implementation).

Making iterator authors aware of their clients' possible need to clone
doesn't sound bad to me.  There's no _compulsion_ to provide the
functionality, but some "social pressure" to do it if a refactoring can
afford it, well, why not?

> Another issue is that it would make generators second-class citizens,
> since they cannot be cloned.  (It would seem to be possible to copy a
> stack frame, but then the question begs whether to use shallow or deep
> copying -- if a local variable in a generator references a list,
> should the list be copied or not?  And if it should be copied, should
> it be a deep or shallow copy?  There's no good answer without knowing
> the intention of the programmer.)

Hmmm, there's worse -- if a generator uses an iterator the latter should
be cloned, not copied, to produce the generator-clone effect, e.g.

def by2(it):
    for x in it: yield x*2

If it is a list I don't think this is a problem -- already now the user
cannot change it for the lifetime of iterators produced by by2(it)
without wierd effects, e.g. "for x in by2(L): L.append(x)" gives an
infinite loop.
But if it is an iterator it should be cloned at the time an iterator
produced by by2(it) is cloned.  Eeep.  No, you're right, in the general
case I cannot see how to clone generator-produced iterators.

> > It seems to me that the ability to back up and that of snapshotting
> > are somewhat independent.
>
> Backing up suggests a strictly limited buffer; cloning suggests a

Unless you need to provide "unlimited undo", yes, but that's a harder
problem anyway (needing different architecture).

> > may be just because it's the one case for which I happened to
> > stumble on some use cases in production (apart from "undoing", which
> > isn't too bad to handle in other ways anyway).
>
> I'd like to hear more about those cases, to see if they really need
> cloning (:-) or can live with a fixed limited backup capability.

I have an iterator it whose items, after an arbitrary prefix terminated by 
the first empty item, are supposed to be each 'yes' or 'no'.

I need to process it with different functions depending if it has certain 
proportions of 'yes'/'no' (and yet another function if it has any invalid 
items) -- each of those functions needs to get the iterator from right
after that 'first empty item'.

Today, I do:

def dispatchyesno(it, any_invalid, selective_processing):
    # skip the prefix
    for x in it:
        if not x: break
    # snapshot the rest
    snap = list(it)
    it = iter(snap)
    # count and check
    yeses = noes = 0
    for x in it:
        if x=='yes': yeses += 1
        elif x=='no': noes += 1
        else: return any_invalid(snap)
    total = float(yeses+noes)
    if not total: raise ValueError, "sequence empty after prefix"
    ratio = yeses / total
    for threshold, function in selective_processing:
        if ratio <= threshold: return function(snap)
    raise ValueError, "no function to deal with a ratio of %s" % ratio

(yes, I could use bisect, but the number of items in selective_processing
is generally quite low so I didn't bother).

Basically, I punt and "snapshot" by making a list out of what is left of
my iterator after the prefix.  That may be the best I can do in some cases,
but in others it's a waste.  (Oh well, at least infinite iterators are not a
consideration here, since I do need to exhaust the iterator to get the
ratio:-).  What I plan to do if this becomes a serious problem in the
future is add something like an optional 'clone=None' argument so I
can code:
    if clone is None:
        snap = list(it)
        it = iter(snap)
    else: snap = clone(it)
instead of what I have hardwired now.  But, I _would_ like to just do, e.g.:
    try: snap = it.clone()
    except AttributeError:
        snap = list(it)
        it = iter(snap)
using some standardized protocol for "easily clonable iterators" rather
than requiring such awareness of the issue on the caller's part.

> I think a standard backup wrapper would be a useful thing to have
> (maybe in itertools?); since generator functions can't be cloned, I'm
> going to push back on the need for cloning for now until I see a lot
> more non-toy evidence.

Very reasonable, sure.  I suspect the discussion of backup wrapper
is best moved to another thread, given this msg is so long and there
are all the usual finicky details to nail down....

Alex