[Web-SIG] WSGI Amendments thoughts: the horror of charsets

Thu Nov 13 00:44:53 CET 2008

FWIW, there was a past discussion on these issues on mod_wsgi list. I
can't really remember what the outcome of the discussion was. The
discussion is at:

  http://groups.google.com/group/modwsgi/browse_frm/thread/2471a1a71620629f

Graham

2008/11/13 Andrew Clover <and-py at doxdesk.com>:
> It would be lovely if we could allow WSGI applications to reliably accept
> Unicode paths.
>
> That is to say, allow WSGI apps to have beautiful URLs like Wikipedia's,
> without requiring URL-rewriting magic. (Which is so highly server-specific,
> potentially unavailable to non-admin webmasters, and makes WSGI app
> deployment more difficult than it already is.)
>
>
> If we could reliably read the bytes the browser sends to us in the GET
> request that would be great, we could just decode those and be done with it.
> Unfortunately, that's not reliable, because:
>
> 1. thanks to an old wart in the CGI specification, %XX hex escapes are
> decoded before the character is put into the PATH_INFO environment variable;
>
> 2. the environment variables may be stored as Unicode.
>
> (1) on its own gives us the problem of not being able to distinguish a
> path-separator slash from an encoded %2F; a long-known problem but not one
> that greatly affects most people.
>
> But combined with (2) that means some other component must choose how to
> decode the bytes into Unicode characters. No standard currently specifies
> what encoding to use, it is not typically configuarable, and it's certainly
> not within reach of the WSGI application. My assumption is that most
> applications will want to end up with UTF-8-encoded URLs; other choices are
> certainly possible but as we move towards IRI they become less likely.
>
>
> This situation previously affected only Windows users, because NT
> environment variables are native Unicode. However, Python 3.0 specifies all
> environment variable access is through a Unicode wrapper, and gives no way
> to control how that automatic decoding is done, leaving everyone in the same
> boat.
>
> WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ should
> be "decoded from the headers using HTTP standard encodings (i.e. latin-1 +
> RFC 2047)", but unfortunately this doesn't quite work:
>
> 1. for many existing environments the decoding-from-headers charset is out
> of reach of the WSGI server/layer and may well not be ISO-8859-1. Even
> wsgiref doesn't currently use 8859-1 (see below).
>
> 2. RFC2047 is not applicable to HTTP headers, which are not really
> 822-family headers even though they look just like them. The sub-headers in
> eg. a multipart/form-data chunk *are* (probably) proper 822 headers so
> RFC2047 could apply, but those headers are already dealt with by the
> application or framework, not WSGI. HTTP 1.1 (RFC2616) does refer to RFC2047
> as an encoding mechanism for TEXT and quoted-string, but this makes no sense
> as 2047 itself requires embedding in atom-based parsing sequences which
> those productions are not (quoted-strings are explicitly disallowed by 2047
> itself). In any case no existing browser attempts to support RFC2047
> encoding rules for any possible interpretation of what 2616 might mean.
>
>
> Something like Luís Bruno's ORIGINAL_PATH_INFO proposal
> (http://mail.python.org/pipermail/web-sig/2008-January/003124.html) would be
> worth looking at for this IMO. It may be of questionable usefulness if the
> only character affected is the slash, but it also happens to solve the
> Unicode problem. Obviously whatever it was called it would have to be an
> optional additional value in the WSGI environ, as pure CGI servers wouldn't
> be able to supply it. Conceivably it might also be possible to have a
> standardised mod_rewrite rule to make the variable also available to Apache
> CGI scripts, but still this is far from global availability.
>
> In the meantime I've been looking at how various combinations of servers
> deal with this issue, and in what circumstances an application or middleware
> can safely recover all possible Unicode input. 'Apache' refers to the
> (AFAICT-identical) behaviour of both mod_cgi and mod_wsgi; 'IIS' refers to
> IIS with CGI.
>
>
> *** Apache/Posix/Python2
> OK.
>
> No problem here, it's byte-based all the way through.
>
>
> *** Apache/Posix/Python3:
> Dependent on the default encoding.
>
> Apache puts bytes into the envvars but Python takes them out as unicode. If
> the system default encoding happens to be the same as the encoding the WSGI
> application wanted we will be OK. Normally the app will want UTF-8; many
> Linux distributions do use UTF-8 as the default system encoding but there
> are plenty of distros (eg. Debian) and other Unixen that do not. In any case
> we are getting a nasty system dependency at deploy time that many webmasters
> will not be able to resolve.
>
> It is sometimes possible to recover mangled characters despite the wrong
> decoding having been applied. For example if the system encoding was
> ISO-8859-1 or another encoding that maps every byte to a unique Unicode
> character, we can encode the Unicode string back to its original bytes, and
> thence apply the decoding we actually wanted! If, on the other hand, it's
> something like ISO-8859-4, where not all high bytes are mapped at all, we'll
> be losing random characters... not good.
>
>
> *** Apache/NT/Python2
> Always unrecoverable data loss.
>
> Apache on Windows always uses ISO-8859-1 to decode the request path and put
> it in the Unicode envvars. This is OK so far, we have Unicode characters
> with the same codepoints as the original bytes. However, Python2 needs to
> make the envvars available as bytes. It uses the system default encoding; if
> that were ISO-8859-1, we'd be OK.
>
> But it never is. Western European on NT is actually cp1252, whose characters
> in the range 0x80 to 0x9F differ from ISO-8859-1. And if the app wants
> UTF-8, chances are those characters are going to come up a lot. There is as
> far as I know no user-selectable Windows codepage that can map all the
> Unicode characters up to U+00FF.
>
>
> *** Apache/NT/Python3
> Wrong, but always recoverable.
>
> Python retreives the bytes-encoded-into-Unicode-codepoints string directly
> from the envvars. If the encoding should have been UTF-8 or something else
> other than ISO-8859-1, we can recover the original bytes by re-encoding to
> 8859-1, then decoding using the real charset.
>
>
> *** IIS/NT/Python2
> Mostly unrecoverable data loss.
>
> IIS decodes submitted bytes to Unicode using UTF-8 when it can. But if there
> is an invalid UTF-8 sequence in the bytes it will try again using the system
> codepage. Python will then re-encode the Unicode envvar using the system
> codepage.
>
> If the app is expecting UTF-8 we can decode what Python gives us using the
> system codepage (ie. 'mbcs') and get back any of the submitted characters
> that happened to be in this server's system codepage. Other characters may
> be replaced by question marks or Windows's best attempts to give us
> something useful, which at best may be a character shorn of diacriticals and
> at worst something just completely wrong.
>
> NT's system codepage is never UTF-8, it is not a user-selectable option
> never mind the default. We can improve our chances of getting more
> characters through by using a character set with a wide repertoire, such as
> cp932 (Shift-JIS). But it's still not really proper Unicode support.
>
> If the app is expecting something non-UTF-8 there's not much hope. Even if
> it wanted the same character set as the system codepage, it can't be sure
> that the submitted bytes didn't happen to also be a valid UTF-8 sequence,
> and thus get mangled by IIS decoding them that way.
>
>
> *** IIS/NT/Python3
> OK, as long as the app wants UTF-8.
>
> Incoming UTF-8 bytes are reliably converted to Unicode strings by IIS, and
> directly read by Python from the envvars.
>
> If the application didn't want UTF-8 the situation is about as hopeless as
> with Python2.
>
>
> *** wsgiref.simple_server/(any)/Python2
> OK.
>
> Bytes all the way through.
>
>
> *** wsgiref.simple_server/(any)/Python3:
> Probably will be OK, as long as the app wants UTF-8.
>
> simple_server is currently broken in rc2. However judging by the code, it is
> using urllib.parse.unquote, which assumes UTF-8, so it'll be fine for apps
> that want UTF-8 and hopeless for those that don't.
>
>
> I'd be very interested to hear what other servers are doing in this
> situation - nginx? cherrypy's one? - and wonder if any particular behaviour
> should be 'blessed'.
>
> --
> And Clover
> mailto:and at doxdesk.com
> http://www.doxdesk.com/
> _______________________________________________
> Web-SIG mailing list
> Web-SIG at python.org
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe:
> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com
>