[Web-SIG] Proposal: Avoiding Serialization When Stacking Middleware

Wed Mar 7 04:43:43 CET 2007

Phillip J. Eby wrote:
> At 08:08 PM 3/6/2007 -0600, Ian Bicking wrote:
>> Posted here: http://wsgi.org/wsgi/Specifications/avoiding_serialization
>>
>> Text copied below for discussion:
>>
>>
>> :Title: Avoiding Serialization When Stacking Middleware
>> :Author: Ian Bicking <ianb at colorstudy.com>
>> :Discussions-To: Python Web-SIG <web-sig at python.org>
>> :Status: Proposed
>> :Created: 06-03-2007
>>
>> .. contents::
>>
>> Abstract
>> --------
>>
>> This proposal gives a strategy for avoiding unnecessary serialization
>> and deserialization of request and response bodies.  It does so by
>> attaching attributes to ``wsgi.input`` and the ``app_iter``, as well as
>> a new environment key ``x-wsgiorg.want_parsed_response``.
>>
>> Rationale
>> ---------
>>
>> Output-transforming middleware often has to parse the upstream content,
>> transform it, then serialize it back to a string for output.  The
>> original output may have already been in the parsed form that the
>> middleware wanted.  Or there may be more middleware that does similar
>> transformations on the same kind of objects.
> 
> HTTP already includes a mechanism for specifying what types are accepted 
> by a content consumer: the "Accept" header.  You can always add other 
> values to it to indicate the parsed values you can accept.
> 
> Of course, this doesn't really work well with WSGI - you want the result 
> to actually *be* WSGI...  so you can use the WSGI way of doing this, 
> which is to have a standard wrapper for the specific content type you 
> want to use.

Yeah, using Accept is clever, but not really accurate, since if you 
serialize the WSGI request to HTTP the addition no longer makes sense.

> The wrapper (as with the wsgi "file wrapper") simply puts a WSGI face on 
> a non-WSGI result body, converting it to an iterator of strings, and 
> holding other attributes known to the middleware or other application 
> object.

That just calls for a series of ad hoc techniques, basically, where each 
object type results in a new key in the environment and a new ad hoc 
specification to be made (e.g., wsgi.file_wrapper takes a block size, 
which is specific only to that case).

> This could be implemented as an environ key containing a mapping from 
> types to wrapper functions.  Middleware that wants a type just copies 
> the mapping and overwrites any entries it cares about.  Applications 
> that want to return a non-serialized result just look up the type (using 
> __mro__ order) to find an applicable wrapper.

OK, the dict would avoid multiple different kinds of keys, and 
presumably they'd all have the same signature.  Block size doesn't 
really make any sense to me as a common parameter.  Content type should 
be a common parameter, as something like an lxml object can be 
serialized as either XML or HTML.  I don't think any response headers 
are likely to effect the serialization... though with my specification 
that remains an application concern, so it doesn't have to be resolved 
in the specification.

I hadn't really thought about MRO, though generally I don't trust 
inheritance to be meaningful anyway -- I feel like I'd be more likely to 
a switch on the type than test inheritance.

> Notice that this approach doesn't require any special protocol for these 
> wrappers -- just WSGI.  It's simpler to specify, and simpler to 
> implement than what you propose, while addressing some of the open issues.

The specification isn't particularly long or complicated, IMHO.  The 
implementation is complicated mostly for reasons unrelated to the 
specification -- any output-transforming middleware will be similarly 
complicated.

> Yes, it does have some problems with interface vs. implementation.  ISTM 
> that trying to solve that problem is effectively asking to revive or 
> reinvent PEP 246, however.  But we could explicitly allow the use of 
> type names instead of the actual types.

When playing with implementation I used type names, and actually I 
rather prefer them, but it's not always clear what name to use.  For 
instance, "lxml", "lxml.etree", "lxml.etree.Element", and 
"lxml.etree._Element" all are reasonable names.  Or "ElementTree", 
"ElementTree.Element", "ElementTree._Element", "xml.etree", 
"xml.etree.Element", and "xml.etree._Element".  Or even something like 
"IElement" could make sense in some context (e.g., what if you can 
accept the overlapping interfaces of both lxml and ElementTree?)

At least the actual type object seems easy enough.  OTOH, there are 
actually cases when I'd like to say that I could accept a certain type 
without having to import the type.  E.g., if I wanted to do an XSLT 
transformation, I *could* support several kinds of objects without 
requiring any of them (e.g., lxml, 4DOM, and Genshi Markup).

>> The same things apply to the parsing of ``wsgi.input``, specifically
>> parsing form data.  A similar strategy is presented to avoid
>> unnecessarily reparsing that data.
> 
> I would rather offer an optional 'get_file_storage()' method or some 
> such as a blessed WSGI extension, than have such an open-ended "get 
> whatever you want from the input object" concept floating around.  A 
> strategy which reinvents half of PEP 246 (the *old* PEP 246, before it 
> became almost as complicated as WSGI) seems like overkill to me.

I don't really understand what you are proposing.  This part addresses 
the same issues as presented in 
http://wsgi.org/wsgi/Specifications/handling_post_forms

I really don't *want* to write every wsgi.input to a temporary file just 
because someone else *might* want to reparse the input.  I'd much rather 
do it lazily, as 99% of the time reparsing won't happen.

>> Obviously the code is not simple, but this is the nature of WSGI
>> output-transforming middleware.
> 
> Something I'd like to fix in WSGI 2.0, by getting rid of both 
> "start_response" and "write", but that's a discussion for another time.

Yeah, that'd be nice, but another discussion for another time.

>> Other Possibilities
>> -------------------
>>
>> * You could simply parse everything ever time.
>> * You could pass data through callbacks in the environment (but this can
>> break non-aware middleware).
>> * You can make custom methods and keys for each case.
>> * You can use something other than WSGI.
> 
> And you can use the established WSGI method for adding semantics to a 
> response, using a middleware-supplied wrapper.  I think this is actually 
> the best alternative.

I really don't understand the advantage.

> In truth, it could be as simple as using the class's fully-qualified 
> name as an environ key (perhaps with a prefix or suffix), with the value 
> being a wrapper for objects implementing that protocol.  No 
> x-foobar-wsgiorg-whatchamacallit cruft needed.
> 
> And, it's lightweight enough of a concept to be expressed as a simple 
> "best practice" design pattern.

Best practice is fine, though of course still needs to be documented, as 
this is hardly a practice that people would naturally think about or 
implement.  But I don't really think that practice would be any simpler 
or easier to describe if done completely.  In fact, I think it would 
take exactly the same amount of space to describe.

-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org