[Web-SIG] Proposal to remove SCRIPT_NAME/PATH_INFO

Wed Sep 23 22:21:11 CEST 2009

On Wed, Sep 23, 2009 at 2:38 PM, P.J. Eby <pje at telecommunity.com> wrote:

> At 08:42 AM 9/23/2009 +0200, Armin Ronacher wrote:
>
>> > I then propose that we eliminate SCRIPT_NAME and PATH_INFO.  Instead we
>> > have:
>> IMO they should stick around for compatibility with older applications
>> and be latin1 encoded on Python 3.  But the use is discouraged.
>>
>
> One or the other should be there, not both.  If you allow older code to
> work, this means it could change the old ones but not the new, leaving a
> confused mess for child applications to sort out.

This is my strongly-held opinion as well.  It's been a struggle to get
people to provide accurate SCRIPT_NAMEs, and to represent the idea of
SCRIPT_NAME through SCRIPT_NAME (as opposed to a hodge-podge of different
patterns, configuration, etc).  To provide this information twice would be a
big step backwards, allowing for all sorts of weird bugs and inconsistent
behavior when the two weren't in sync, and depending on which key is given
preference in code.

I *wish* SCRIPT_NAME and PATH_INFO had been strictly required in WSGI 1
(they are in CGI, but not WSGI).  If they were, we'd see more of
environ['PATH_INFO'], which would break fast and obviously, and less
environ.get('PATH_INFO', '').  But... too late for that now.  The new key
should definitely be required.  Then code can even do:

if 'wsgi.path_info' in environ:
    path_info = urllib.unquote(environ['wsgi.path_info']
else:
    path_info = environ.get('PATH_INFO', '')

We should also make sure the new validator works on both versions of WSGI,
which will make it easier to backport checks like making sure wsgi.path_info
is *not* in a WSGI 1 environ.

Not directly in response to this email, several people expressed concern
that some environments provide only the unquoted path.  I think it's not
terribly horrible if they fake it by re-quoting the path.  In CGI/Python 3
this would be something like:

environ['wsgi.script_name'] =
urllib.request.quote(os.environ['SCRIPT_NAME'].encode(sys.getdefaultencoding(),
'surrogateescape'))

(obviously urllib.request.quote needs to be fixed to work on bytes; though
the implementation is also small enough we could show the correct
implementation in the spec, and warn implementors not to trust
urllib.request.quote to work in Python 3.0-3.1.1)

I also believe you can safely reconstruct the real SCRIPT_NAME/PATH_INFO
from REQUEST_URI, which is usually available (at least in contexts where
this sort of thing is a problem).  I am not up to thinking it through right
now, as it's not a trivial algorithm, but I'm sure it can be done.  Really
it's just a question of how much you can avoid brute force, because you
could always do:

def real_path(request_uri, script_name, path_info):
    for i in range(request_uri):
        if urllib.request.unquote(request_uri[:i]) == script_name:
            return request_uri[:i], request_uri[i:]
    # Something is messed up, fake it
    return urllib.request.quote(script_name),
urllib.request.quote(path_info)

I think you could do better than character-by-character (instead by path
segment), and in particular do it faster when %2f doesn't appear in the path
at all (the common case).  This would be appropriate code for wsgiref.

>
>  If we go about dropping start_response, can we move the app iter to the
>> beginning?  That would be consistent with the signature of common
>> response objects, making it possible to do this:
>>
>>    response = Response(*hello_world(environ))
>>
>
> When you say "beginning", do you mean the beginning of the return tuple?
>  That is:
>
>    return ['body here'], '200 OK', [('Header', 'value')]
>
> I'd be surprised if a lot of response objects had such a signature, since
> that's not the order a server would actually output that data in.

It'd be more reasonable to change the Response __init__ signature, like:

class Response(object):
    def __init__(self, body_or_wsgi_response, status=None, headers=None):
        if isinstance(body_or_wsgi_response, tuple):
            status, headers, body = body_or_wsgi_response
        else:
            body = body_or_wsgi_response

If you allow an iterator for a body argument, it could be a tuple; but at
least WebOb doesn't allow iterators, only str/unicode.  (You can give an
iterator, but you need to do it with an app_iter keyword argument.)  I don't
know what Werkzeug or other frameworks allow.

>
>  In general I think doing too many changes at once is harmful
>>
>
> Actually, the reverse is true for standards.  Incremental change means more
> versions, which goes counter to the point of having a standard in the first
> place.

Yeah; WSGI 1.1 is just errata, I expect to change very little code.  I'd
rather make just one change to WSGI 2.  And it doesn't seem so hard really.

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090923/dde4e0ba/attachment-0001.htm>