streams (was: Re: Itertools)

Beni Cherniavsky cben at techunix.technion.ac.il
Wed Jul 30 15:20:23 EDT 2003


Ronald Legere wrote on 2003-07-29:

> The new itertools stuff is pretty cool. One thing that bothers me
> though is that there seems to be no way to copy an iterator. How
> does one work around this? Is there a trick that I am missing?
>
> Without being able to 'split' an iterator, I can't think of how to
> do some of the cool things you can do with lazy lists in Haskell. It
> seems quite often you want to be able to do this 'splitting', and if
> I had more coffee I would post an example :) The ones I can think of
> are also defined recursively, which is another kettle of fish.
>
Strange, the need seems to come up more freqeuntly since I implemented
it ;-).  Or is it just me paying more attention to it?  After a couple
of design iterations, I've got:

http://www.technion.ac.il/~cben/python/streams.py

[NEWS: The last changes were addition of __nonzero__ methods, clued by
discussion on c.l.py <wink>, and a `.force(count=None)` method to
force computing values.]

I concluded that it's best to separate the two abstractions, providing
ability to convert both ways:

- An iterable immutable (but lazily computed) linked list, with O(1)
  tail access and ability to cons in front of it.

- An iterator object whose `.next()` method is necessarily
  destructive, on another hand.  Turns out the most flexible
  implementation of the iterator is simply a pointer to a linked list;
  this way you can split the iterator and even "unyield" values back
  onto it.  See the module for more details.

In general it's most clean to separate iterables from iterators.
Iterables should be changed by iterating.

Any feedback welcome.  I'd like to make it as Pythonic as possible.
An perhaps make it standard.  Particular open questions:

- Terminology.  There are too many names for the same things ;-).

  - The linked lists are called "streams" by me because they are lazy
    and that's what such beasts are called in functional languages.
    But I still have many refernces to them as linked lists,
    especially w.r.t. the `Cons` class which happens to be the base
    class of `Stream`.  So `Stream` is defined as a lazy linked list
    node.  Should I try to squash refences to "linked lists", treating
    them as "stream nodes which are already computed" to reduce
    confusion with Python lists?

    - car/cdr vs. first/rest vs. head/tail.  I chose head/tail.
      car/cdr would be opaque to people without lisp background.
      first/rest are of different length :-).

    - `Cons` still requires lisp background but I don't know any name
      that would be recognizable to non-lispers anyway.  And it's not
      a bad name.

  - Is `StreamIter` the best name?

    - I called the operation for creating new iterator from same place
      "forking".  Alternative names: `split`, `copy`, `clone`, `dup`,
      "lookahead", etc.  What's best?

    - Consing in front of a StreamIter's stream is called `.unyeild`.
      The point is that it's the opposite of the ``yield`` statement
      but the name is a brave new invention.  Alternatives: `prev`
      (antonym of `.next()`, probably misguided since it doesn't
      return the previous value but rather takes it from you),
      `pushback`, `unread`, `stuff`, etc.

- Functionality:

  - Should the operations (`.head`/`.tail`, indexing) on the
    underlying stream be proxied by `StreamIter` or should the
    programmer have to spell out e.g. ``iterator.stream.head`` as now?

  - What to do about __repr__ and __str__?  The streams could well be
    infinite and forcing lookahead for even a few values is
    unacceptable as this could be expensive.

  - What to do about __hash__ and __eq__?

  - Should I weakly memoize the `Stream` for each given iterable?
    That would reduce the risk of wrapping a destructive iterator
    twice with distinct streams, so that each produced item is seen by
    only one stream, with access-order-dependant interleaving,
    violating the functional intent of the laziness.  OTOH, this would
    make the programmer less aware of the risk.  Besides, I'm not sure
    what to return on the second attempt to wrap same iterator after
    some of it has already been consumed since the first time.

  - Would some convenience operations for lookahead, e.g.
    `.startswith()` be useful?

  - Anything else anybody needs?

-- 
Beni Cherniavsky <cben at tx.technion.ac.il>

Put a backslash at the evening to continue hacking onto the next day.





More information about the Python-list mailing list