[Web-SIG] WSGI 2.0 Round 2: requirements and call for interest

Tue Jan 5 17:19:15 EST 2016

> On 6 Jan 2016, at 12:09 AM, chris.dent at gmail.com wrote:
> 
> As someone who writes their WSGI applications as functions that take
> `start_response` and `environ` and doesn't bother with much
> framework the things I would like to see in a minor revision to WSGI
> are:
> 
> * A consistent way to access the raw un-decoded request URI. This is
>  so I can reconstruct a realistic `PATH_INFO` that has not been
>  subjected to destructive handling by the server (e.g. apache
>  messing with `%2F`) before continuing on to a route dispatcher.

This is already available in some servers by way of the REQUEST_URI value.

This is the original first line of any HTTP request and can be split apart to get the path.

The problem is that you cannot easily use it unless you want to replicate normalisations that the underlying server may do.

The key problem is working out where SCRIPT_NAME ends and PATH_INFO starts with the original path given in REQUEST_URI.

Sure if you only deal with a web application mounted at the root of the host it is easier because SCRIPT_NAME would be empty, but when mounted at a sub URL it gets trickier.

This is because a web server will eliminate things like repeating slashes in the part of the path that may match the mount point (sub url) for the web application. The sub url here could be dictated by what is defined in a configuration file, or could instead be due to matching against a file system path.

Further, the web server will eliminate attempts at relative directory traversal using ‘..’ and ‘.’.

So an original path may be something like:

    /a/b//c/../d/.//e/../f/g/h

If the mount point was ‘/a/b/d’, then that is what gets passed through SCRIPT_NAME.

Now if you instead go to the raw path you would need to replicate all the normalisations. Only then could you maybe based on length of SCRIPT_NAME, number of component parts, or actual components in the path, try and calculate where SCRIPT_NAME ended and PATH_INFO started in the raw path.

This will still all fail if a web server does internal rewrites though, as the final SCRIPT_NAME may not even match the raw path, although at that point URL reconstruction can be a problem as well if what the application is given by way of the rewrite isn’t a public path.

I have only looked at SCRIPT_NAME. Even in PATH_INFO servers will apply same sort of normalisations.

So even this isn’t so simple to do properly if you want to go back and do it yourself using the raw path.

I have never seen anyone trying to extract repeating slashes intact out of a raw path even attempt to do it properly. They tend to assume that the raw path is pure and doesn’t have stuff in it which needs to be normalised and that rewrites aren’t occurring. As a result they assume that they can just strip number of characters off raw path based on length of SCRIPT_NAME passed through. This will be fragile though if the raw path isn’t pure.

Graham
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20160106/72dc8e57/attachment.html>