[Web-SIG] Unicode in Python 3

René Dudfield renesd at gmail.com
Sat Sep 19 15:51:06 CEST 2009


On Sat, Sep 19, 2009 at 1:54 PM, Graham Dumpleton
<graham.dumpleton at gmail.com> wrote:
> 2009/9/19 Armin Ronacher <armin.ronacher at active-4.com>:
>> Graham's suggestion for URL encodings means that the URL encoding would
>> ahve to be passed to the WSGI server from outside (he proposed the
>> apache config as an example).  This means that the application behavior
>> will change based on the server configuration, causing even more confusion.
>
> No it doesn't and you could still have things work without needing to
> override the default encodings applied.
>
> The default rule inside of the WSGI adapter would be:
>
>  try:
>    script_name = raw_script_name.decode('utf-8')
>    path_info = raw_path_info.decode('utf-8')
>    query_string = raw_query_string.decode('utf-8')
>    uri_encoding = 'utf-8'
>  except:
>    script_name = raw_script_name.decode('iso-8859-1')
>    path_info = raw_path_info.decode('iso-8859-1')
>    query_string = raw_query_string.decode('iso-8859-1')
>    uri_encoding = 'iso-8859-1'
>  finally:
>    environ['SCRIPT_NAME'] = script_name
>    environ['PATH_INFO'] = path_info
>    environ['QUERY_STRING'] = query_string
>    environ['wsgi.uri_encoding'] = uri_encoding
>
> At the WSGI application level, if it provides for use of an alternate
> URI encoding, I saw that all it would need to do (ignoring encoding
> name equivalence issues for now) is:
>
>  if application_uri_encoding != environ['wsgi.uri_encoding']:
>    raw_script_name =
> environ['SCRIPT_NAME'].encode(environ['wsgi.uri_encoding'])
>    raw_path_info = environ['PATH_INFO'].encode(environ['wsgi.uri_encoding'])
>    raw_query_string =
> environ['QUERY_STRING'].encode(environ['wsgi.uri_encoding'])
>
>    script_name = raw_script_name.decode(application_uri_encoding)
>    path_info = raw_path_info.decode(application_uri_encoding)
>    query_string = raw_query_string.decode(application_uri_encoding)
>
>  else:
>    script_name = environ['SCRIPT_NAME']
>    path_info = environ['PATH_INFO']
>    query_string = environ['QUERY_STRING']
>
> So, no strict need to make the WSGI adapter do it differently. You may
> want to only do that if concerned about overhead of transcoding.
>
> Transcoding just these is most probably going to be less overhead than
> the WSGI adapter having to set up both unicode and raw values in a
> dictionary for everything.
>

Can these be lazily transcoded?

I think they can if they are turned into callables.  Since the environ
has to be of a dict type, and not some other type(unless that design
should also be changed).

So the current ones stay as is... to reflect current usage, and new
ones use callables.  The callables return the type you ask for.  This
way its possible to not do any encoding/decoding as needed and only
when needed.


For applications using the new way:
    # we can pass in the encoding we want.
    script_name = environ['SCRIPT_NAME_'](application_uri_encoding)
    script_name_utf8 = environ['SCRIPT_NAME_']('utf-8')
    script_name_iso_8859_1 = environ['SCRIPT_NAME_']('iso-8859-1')

    # we can get it as a buffer.
    script_name_buffer = environ['SCRIPT_NAME_'](as_buffer = True)
    # we can get it as whatever the raw native type is.
    script_name_native = environ['SCRIPT_NAME_'](native_type = True)

    # here we get the default encoding and type - which could be
unicode or bytes.
    script_name_default_type = environ['SCRIPT_NAME_']()

For servers:
    Servers store just the native raw version in the environ(as
buffer, or whatever their native type and encoding is), and callables
to do any transcoding as needed.  If the application does not use it,
then the server doesn't use any resources transcoding.


More information about the Web-SIG mailing list