[Web-SIG] thoughts on an iterator

Sat Mar 28 22:42:40 CET 2009

Brandon Craig Rhodes wrote:
> Graham, I confess that it was I who brought up the idea of a
wsgi.input
> iterator at the WSGI Open Space yesterday evening. :-) The discussion
> seemed to be assuming a file-like input object that could be read from
> by a piece of middleware, then "backed up" or "rewound" before passing
> it down to the next layer.  This seemed to have problems: it doesn't
> support the case where the middleware wants to alter the input or pass
> it piecemeal down to the client as it comes in, and it also means that
> the *entire* input stream has to be kept around in memory for the
> lifetime of the whole request in case the client reading it is not the
> "real client" at the bottom of the stack, and a request is coming that
> will ask for the whole thing to be replayed.
> 
> So, I suggested placing the responsibility for rewind and buffering on
> the middleware.  You want to read 2k of the input to make a middleware
> decision before invoking the next layer down?  Then read it, and pass
> along a fresh iterator that first yields that 2k, then starts yielding
> everything from the partially-read iterator.  Or, you can pass along a
> filter iterator that scans or changes the entire stream as it reads it
> from the upstream iterator.
> 
> But, having through more about the idea, I think that your criticisms,
> Graham, are exactly on-target.  Iterators don't give enough control to
> the reader to ask about the chunks (lines or blocks) that get
delivered
> as they read.  So at the very least we should indeed be looking at a
> file-like object;

Hmmmm. Graham brought up chunked requests which I don't think have much
bearing on this issue--the server/app can't rely on the client-specified
chunk sizes either way (or you enable a Denial of Service attack). I
don't see much difference between the file approach and the iterator
approach, other than moving the read chunk size from the app (or more
likely, the cgi module) to the server. That may be what kills this
proposal: cgi.FieldStorage expects a file pointer and I doubt we want to
either rewrite the entire cgi module to support iterators, or re-package
the iterator up as a file.

> it's still easy to construct a file-like object that's
> really streaming from another file as it comes in, and we could even
> provide shortcuts to build files from inline iterators or something.

Right; either approach can be re-streamed pretty easily.

> And, the idea that each piece of middleware does its *own* buffering
> might be a bad one too.  One might naively store everything in RAM,
> another might put blocks on disk, another might run you out of /tmp
> space trying to do the same thing - even storing duplicates of the
same
> data if we're not careful!  The same 1MB initial block could wind up
on
> disk two or three times if each piece of middleware thinks it's the
one
> with it cached to pass along to the bottom layer that's reading 16k
> blocks at a time.

Any middleware which did so would pretty quickly get fixed or abandoned.
I don't think that's a strong argument given that we have many
developers with experience in this area from existing middleware.

> So what's left of my suggestion?  I suggest that we *not* commit to
> unlimited rewinding of the input stream; that was my single real
> insight, and an uncontrollable iterator design gives up too much in
> order to achieve that.  A file-like object is more appropriate, but we
> either need to make middleware do its own caching of
partially-consumed
> data, *or* we need some way for middleware to signal whether it needs
> the older data kept.
> 
> Could "input.bookmark()" signal that everything from this point on in
> the stream needs to be retained, in memory or on disk, to be rewound
to
> later?  And the data be released only after the bookmark is deleted?
> 
>    b = input.bookmark()
>    input.read()...
> 
>    input2 = b.file()
>    del b
> 
> Or, we could allow the "input" object to support cloning, where all
> data
> is cached from the clone-that's-read-least-far to the one that's read
> the farthest:
> 
>    c = input.clone()
>    input.read(100)
>    # 100 bytes are now cached by the framework, in RAM or on disk or
on
>    # a USB keyfob or wherever this framework puts it. (Django will
> write
>    # their own caching that's different from everyone else's).
>    c.read(100)
>    # the bytes are released
>    del c
>    # Now that there's just one active clone, no buffering takes place.
> 
> That way one could "read ahead" on your own input, while passing the
> complete stream back down to the next level.  This has the
disadvantage
> that if a middleware piece wants to keep the first 100MB and last
100MB
> from a stream but throw out the middle, it's got no way to do so
> without
> dropping back to its own caching scheme that the framework can't
> coordinate with other schemes; but it seems to cover the majority of
> cases that I can think of.

Those seem like strategies for individual middleware components to
implement, not necessary to burden the general case with it.

> Anyway: no unlimited caching, no unlimited rewind; that's my argument.
> Iterators were just one way of cleaning getting there, but probably,
in
> the light of the next day, not a powerful enough way.

I'd vote to stick with the file-like approach for no other reason than
that FieldStorage expects one.

Robert Brewer
fumanchu at aminus.org