[Web-SIG] thoughts on an iterator

Sat Mar 28 20:24:33 CET 2009

Graham, I confess that it was I who brought up the idea of a wsgi.input
iterator at the WSGI Open Space yesterday evening. :-) The discussion
seemed to be assuming a file-like input object that could be read from
by a piece of middleware, then "backed up" or "rewound" before passing
it down to the next layer.  This seemed to have problems: it doesn't
support the case where the middleware wants to alter the input or pass
it piecemeal down to the client as it comes in, and it also means that
the *entire* input stream has to be kept around in memory for the
lifetime of the whole request in case the client reading it is not the
"real client" at the bottom of the stack, and a request is coming that
will ask for the whole thing to be replayed.

So, I suggested placing the responsibility for rewind and buffering on
the middleware.  You want to read 2k of the input to make a middleware
decision before invoking the next layer down?  Then read it, and pass
along a fresh iterator that first yields that 2k, then starts yielding
everything from the partially-read iterator.  Or, you can pass along a
filter iterator that scans or changes the entire stream as it reads it
from the upstream iterator.

But, having through more about the idea, I think that your criticisms,
Graham, are exactly on-target.  Iterators don't give enough control to
the reader to ask about the chunks (lines or blocks) that get delivered
as they read.  So at the very least we should indeed be looking at a
file-like object; it's still easy to construct a file-like object that's
really streaming from another file as it comes in, and we could even
provide shortcuts to build files from inline iterators or something.

And, the idea that each piece of middleware does its *own* buffering
might be a bad one too.  One might naively store everything in RAM,
another might put blocks on disk, another might run you out of /tmp
space trying to do the same thing - even storing duplicates of the same
data if we're not careful!  The same 1MB initial block could wind up on
disk two or three times if each piece of middleware thinks it's the one
with it cached to pass along to the bottom layer that's reading 16k
blocks at a time.

So what's left of my suggestion?  I suggest that we *not* commit to
unlimited rewinding of the input stream; that was my single real
insight, and an uncontrollable iterator design gives up too much in
order to achieve that.  A file-like object is more appropriate, but we
either need to make middleware do its own caching of partially-consumed
data, *or* we need some way for middleware to signal whether it needs
the older data kept.

Could "input.bookmark()" signal that everything from this point on in
the stream needs to be retained, in memory or on disk, to be rewound to
later?  And the data be released only after the bookmark is deleted?

   b = input.bookmark()
   input.read()...

   input2 = b.file()
   del b

Or, we could allow the "input" object to support cloning, where all data
is cached from the clone-that's-read-least-far to the one that's read
the farthest:

   c = input.clone()
   input.read(100)
   # 100 bytes are now cached by the framework, in RAM or on disk or on
   # a USB keyfob or wherever this framework puts it. (Django will write
   # their own caching that's different from everyone else's).
   c.read(100)
   # the bytes are released
   del c
   # Now that there's just one active clone, no buffering takes place.

That way one could "read ahead" on your own input, while passing the
complete stream back down to the next level.  This has the disadvantage
that if a middleware piece wants to keep the first 100MB and last 100MB
from a stream but throw out the middle, it's got no way to do so without
dropping back to its own caching scheme that the framework can't
coordinate with other schemes; but it seems to cover the majority of
cases that I can think of.

Anyway: no unlimited caching, no unlimited rewind; that's my argument.
Iterators were just one way of cleaning getting there, but probably, in
the light of the next day, not a powerful enough way.

-- 
Brandon Craig Rhodes   brandon at rhodesmill.org   http://rhodesmill.org/brandon