[Web-SIG] Unicode in Python 3

Sat Sep 19 18:00:04 CEST 2009

Hello again,

ok, getting back on topic... away from py3k porting methods...

Using an API where the user can request the type wanted solves a lot
of encoding issues.

This is similar to Grahams suggestion, but instead allowing a user to
request which encoding they want, and also get access to the raw data
if needed.

What is proposed:
    1. Default utf-8 to be used.
    2. A buffer to be used for raw data.
    3. New keys which are callables to request the encoding you want.
    4. Encoding keys are specified.
    4.a URI encoding key 'wsgi.uri_encoding'
    4.b Form data encoding key 'wsgi.form_encoding'
    4.c Page encoding key 'wsgi.page_encoding'
    4.d Header encoding key 'wsgi.header_encoding'
    5. For next version of wsgi (1.1 or 2.0), using an adapter for
backwards compat for wsgi 1.0 apps on wsgi2 server.

This allows or this is good because:
    1. utf-8 is most common for frameworks and web browsers.
    2.a Raw values to be accessed in the rare cases they are needed.
    2.b More performant wsgi servers (zero-copy and zero-allocation
become possible with buffers)
    2.c Avoiding bytes type and syntax for compatibility with <=
python 2.5.4 (buffer, and unicode)
    3. Transcoding to only happen if needed.
    4. URI encoding can be explicitly stated in a URI key
    5. Backwards compat for wsgi 1.0 apps on wsgi 2 server.  Also wsgi
2.0 apps on wsgi 1.0 server with an adapter.

How applications use this proposal:
   # here we get the default encoding and type - unicode utf-8, and
it's urldecoded.
   script_name_default_type = environ['SCRIPT_NAME']()

   # we can pass in the encoding we want.
   script_name = environ['SCRIPT_NAME'](application_uri_encoding)
   script_name_utf8 = environ['SCRIPT_NAME']('utf-8')
   script_name_iso_8859_1 = environ['SCRIPT_NAME']('iso-8859-1')

   # we can get it as a buffer with raw bytes.
   script_name_buffer = environ['SCRIPT_NAME'](as_buffer = True,
no_urldecoding = True)
   # we can get it as whatever the raw native type is.
   script_name_native = environ['SCRIPT_NAME'](native_type = True,
no_urldecoding = True)

For servers:
   Servers store only the native raw version in the environ(as buffer,
or whatever their native type and encoding is), and callables to do
any transcoding as needed.  If the application does not use it, then
the server doesn't use any resources transcoding or storing different
transcoded versions.

Adapters:
    To make it easier for backwards compatibility wsgiref should have
adapters for old servers and clients.

  For wsgi 1.0 apps on wsgi 2.0 servers:
      An adapter would be written to return a wsgi1 key suitable environ.

  For wsgi 1.0 servers running wsgi 2.0 apps.
      An adapter should be available to let wsgi 2.0 apps run on wsgi
1.0 servers.

Issues with proposal?  Things this proposal did not consider?

    - maybe we could be explicit about what the http server, http
client, wsgi client, and application think the encodings are.  This
might allow 'fail fast', and sanity checking so things aren't messed
up silently.  If the webserver, web client and application developer
all specifiy what they are expecting... then checks could be done,
otherwise if one of them can't specify for some reason, then it's the
situation we are in now.  Haven't thought this through much.