[Web-SIG] WSGI 2: Decoding the Request-URI

Mon Aug 17 05:06:03 CEST 2009

I wrote:
> PATH_INFO and QUERY_STRING are ... decoded via a configurable
> charset, defaulting to UTF-8. If the path cannot be decoded
> with that charset, ISO-8859-1 is tried. Whichever is successful
> is stored at environ['REQUEST_URI_ENCODING'] so middleware and
> apps can transcode if needed.

and Ian replied:
> My understanding is that PATH_INFO *should* be UTF-8 regardless of
> what encoding a page might be in.  At least that's what I got when
> testing Firefox.  It might not be valid UTF-8 if it was manually
> constructed, but then there's little reason to think it is valid...

Actually, current browsers tend to use UTF-8 for the path, and either the encoding of the document [1] or Windows-1252 [2] for the querystring. But the vast majority of HTTP user agents are not browsers [3]. Even if that were not so, we should not define WSGI to only interoperate with the most current browsers.

and Graham added:
> Thinking about it for a while, I get the feel that having a fallback
> to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That
> URLs wouldn't consistently use the same encoding all the time just
> seems wrong. I would see it as returning a bad request status. If an
> application coder knows they are actually going to be dealing with
> latin-1, as that is how the application is written, then they should
> be specifying it should be latin-1 always instead of utf-8. Thus, the
> WSGI adapter should provide a means to override what encoding is used.

Applications do produce URI's (and IRI's, etc. that need to be converted into URI's) and do transfer them in media types like HTML, which define how to encode a.href's and form.action's before %-encoding them [4]. But these are not the only vectors by which clients obtain or generate Request-URI's.

> For simple WSGI adapters which only service one WGSI application, then
> it would apply to whole URL namespace.

As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a filename encoding defined by the OS which is different than that of the rest of the URI's generated/understood by even the most coherent application.

The encoding used for a URI is only really important for one reason: URI comparison. Comparison is at the heart of handler dispatch, static resource identification, and proper HTTP cache operation. It is for these reasons that RFC 3986 has an extensive section on the matter [5], including a "ladder" of approaches:

 * Simple String Comparison
 * Case Normalization (e.g. /a%3D == /a%3d)
 * Percent-Encoding Normalization (e.g. /a%62c == /abc)
 * Path Segment Normalization (e.g. /abc/../def == /def)
 * Scheme-Based Normalization (e.g. http://example.com == http://example.com:80/)
 * Protocol-Based Normalization (e.g. /data == /data/ if previous dereferencing showed it to be)

I think it would be beneficial to those who develop WSGI application interfaces to be able to assume that at least case-, percent-, path-, and scheme-normalization are consistently performed on SCRIPT_NAME and PATH_INFO by all WSGI 2 origin servers.

All of those except for the first one can be accomplished without decoding the target URI. But that first section specifically states: "In practical terms, character-by-character comparisons should be done codepoint-by-codepoint after conversion to a common character encoding." In other words, the URI spec seems to imply that the two URI's "/a%c3%bf" and "/a%ff" may be equivalent, if the former is u"/a\u00FF" encoded in UTF-8 and the latter is u"/a\u00FF" encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ values must be byte strings. IMO WSGI 2 should do better in this regard.

> For something like Apache where
> could map to multiple WSGI applications, then it may want to provide
> means of overriding encoding for specific subsets o URLs, ie., using
> Location directive for example.

For the three reasons above, I don't think we can assume that the application will always receive equivalent URI's encoded in a single, foreseen encoding. Yet we still haven't answered the question of how to handle unforeseen encodings. You're right that, if the server-side stack as a whole cannot map a particular URI to a handler, it should respond with a 4xx code. I'd prefer 404 over 400, but either is fine.

However, we quite often use only a portion of the URI when attempting to locate an appropriate handler; sometimes just the leading "/" character! The remaining characters are often passed as function arguments to the handler, or stuck in some parameter list/dict. In many cases, the charset used to decode these values either: is unimportant; follows complex rules from one resource to another; or is merely reencoded, since the application really does care about bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI environ entry to declare the charset which was used to decode) can handle all of these cases. Server configuration options cannot, at least not without their specification becoming unwieldy.

Robert Brewer
fumanchu at aminus.org

[1] http://markmail.org/message/r6qzszybsk5pwzbt
[2] http://markmail.org/message/47cekkpvdjaectvi
[3] http://markmail.org/message/3bsxo7q6eztcp3yo
[4] http://www.w3.org/TR/html4/interact/forms.html#idx-character_encoding
[5] http://tools.ietf.org/html/rfc3986#section-6