[Web-SIG] Unicode in Python 3

Sat Sep 19 14:54:10 CEST 2009

2009/9/19 Armin Ronacher <armin.ronacher at active-4.com>:
> Graham's suggestion for URL encodings means that the URL encoding would
> ahve to be passed to the WSGI server from outside (he proposed the
> apache config as an example).  This means that the application behavior
> will change based on the server configuration, causing even more confusion.

No it doesn't and you could still have things work without needing to
override the default encodings applied.

The default rule inside of the WSGI adapter would be:

  try:
    script_name = raw_script_name.decode('utf-8')
    path_info = raw_path_info.decode('utf-8')
    query_string = raw_query_string.decode('utf-8')
    uri_encoding = 'utf-8'
  except:
    script_name = raw_script_name.decode('iso-8859-1')
    path_info = raw_path_info.decode('iso-8859-1')
    query_string = raw_query_string.decode('iso-8859-1')
    uri_encoding = 'iso-8859-1'
  finally:
    environ['SCRIPT_NAME'] = script_name
    environ['PATH_INFO'] = path_info
    environ['QUERY_STRING'] = query_string
    environ['wsgi.uri_encoding'] = uri_encoding

At the WSGI application level, if it provides for use of an alternate
URI encoding, I saw that all it would need to do (ignoring encoding
name equivalence issues for now) is:

  if application_uri_encoding != environ['wsgi.uri_encoding']:
    raw_script_name =
environ['SCRIPT_NAME'].encode(environ['wsgi.uri_encoding'])
    raw_path_info = environ['PATH_INFO'].encode(environ['wsgi.uri_encoding'])
    raw_query_string =
environ['QUERY_STRING'].encode(environ['wsgi.uri_encoding'])

    script_name = raw_script_name.decode(application_uri_encoding)
    path_info = raw_path_info.decode(application_uri_encoding)
    query_string = raw_query_string.decode(application_uri_encoding)

  else:
    script_name = environ['SCRIPT_NAME']
    path_info = environ['PATH_INFO']
    query_string = environ['QUERY_STRING']

So, no strict need to make the WSGI adapter do it differently. You may
want to only do that if concerned about overhead of transcoding.

Transcoding just these is most probably going to be less overhead than
the WSGI adapter having to set up both unicode and raw values in a
dictionary for everything.

Even with your iso-8859-4 example, can't see how you can without
knowing loose what original characters are, as wsgi.uri_encoding being
provided always allows you to transcode to what you needed it to be
when what was supplied didn't match.

As to the separate argument about repeating slashes and percent
encoding of slashes and loosing distinction, the definition using
wsgi.uri_encoding also provided REQUEST_URI as bytes anyway, so you
can get it directly from that as want you wanted in bytes everywhere
solution anyway.

Now you can go back to monologue, as definitely sleeping now. ;-)

Graham