[Web-SIG] Unicode in Python 3

Sat Sep 19 20:14:23 CEST 2009

On Sat, Sep 19, 2009 at 6:00 PM, Armin Ronacher
<armin.ronacher at active-4.com> wrote:
> Hi,
>
> René Dudfield schrieb:
>> What is proposed:
> Where was that proposed?
>
>>     1. Default utf-8 to be used.
> That's a possibility yes, but it has to be carefully be considered.
>
>>     2. A buffer to be used for raw data.
> What is raw data?  If you mean we keep the unencoded data around, I
> would strongly argue against that.  Otherwise it makes middlewares even
> harder to write.
>

raw data in this case is what ever the data from the server is.  The
idea is to convert it on demand.

>>     3. New keys which are callables to request the encoding you want.
> Did I miss something?  Why are we requesting encodings now?
>

You can request encodings.  The idea is to make it explicit about
which encoding you want.

This also allows no conversion to take place if it isn't needed.
Converting strings is a waste of time if it's not needed.

>>> b = "a" * 4096
>>> %timeit b.decode('utf-8')
100000 loops, best of 3: 15 µs per loop

Even length 1 strings take a while too.
>>> b = "a"
>>> %timeit b.decode('utf-8')
1000000 loops, best of 3: 1.84 µs per loop

Note, that you need a method call with the decode anyway.

In comparison a method call is a tiny amount of time.
>>> a = {'SCRIPT_INFO':"asdfasdf", 'SCRIPT_INFO2': lambda : 'asdfasdf2'}
>>> %timeit a['SCRIPT_INFO2']()
1000000 loops, best of 3: 267 ns per loop
>>> %timeit a['SCRIPT_INFO']
10000000 loops, best of 3: 122 ns per loop

This is why avoiding encode/decode work is better.

If environ was allowed to be a non dict... and a real object then it
would be possible to avoid the dict key lookup and the method call.

>>     4. Encoding keys are specified.
>>     4.a URI encoding key 'wsgi.uri_encoding'
>>     4.b Form data encoding key 'wsgi.form_encoding'
>>     4.c Page encoding key 'wsgi.page_encoding'
>>     4.d Header encoding key 'wsgi.header_encoding'
> I don't know where you are getting that from.  The only WSGI key would
> be `wsgi.uri_encoding` and that is only set by the server and only used
> for legacy non UTF-8 URLs.
>

I got that from your list of things with different encodings.  Why not
use it for the other parts as well?  Some header keys use different
encodings, as does form data, and page encodings.

>>     5. For next version of wsgi (1.1 or 2.0), using an adapter for
>> backwards compat for wsgi 1.0 apps on wsgi2 server.
> No decision about WSGI versioning was made so far.  If WSGI in Python 3
> is based on unicode, then the version is raised to 1.1,  2.0 is not yet
> discussed as far as I'm concerned.
>

Sure, it's a separate issue.  However I'm addressing it here, .  WSGI
2.0 has been discussed in various emails recently, and in grahams blog
post.  Also here is a wsgi 2.0 wiki page on wsgi.org.

>>     2.c Avoiding bytes type and syntax for compatibility with <=
>> python 2.5.4 (buffer, and unicode)
> If WSGI for Python 3 is based on Unicode it will use '' for textual
> context and b'' for bytes.  If it's based on bytes it will obviously use
> the byte literals.

Again, using bytes doesn't seem as nice as using buffers along with
unicode.  Since buffers can be faster(not immutable so you can avoid
memory allocation, and make use of zero copy networking), and buffers
are available in more versions of python.

>
>>     3. Transcoding to only happen if needed.
> I can't see how that would work if it's based on unicode, if it's based
> on bytes that's already what happens in WSGI 1.
>

Since you can request different encodings, if an encoding is available
it can be given... if it's not available the conversion can be made.
If you don't need the conversion to be done... the conversion can be
avoided completely.

>>     4. URI encoding can be explicitly stated in a URI key
> This value is only *set* by the server on decode, the value is to be
> ignored by the actual application or middleware except for QUERY_STRING
> and REQUEST_URI decoding.  Everything else makes things a lot more
> complicated without improving anything.
>

yeah, the server states what is happening.

As the application requests what it wants, it doesn't need to query those keys.

>>     5. Backwards compat for wsgi 1.0 apps on wsgi 2 server.  Also wsgi
>> 2.0 apps on wsgi 1.0 server with an adapter.
> Again, WSGI 2.0 is something that has to be discussed separately,
> otherwise we totally lose track.
>
>> Issues with proposal?  Things this proposal did not consider?
> Yes you did:
>
> -  it has no real world advantage over either WSGI based on unicode
>   that is utf-8 with latin1 fallback or a WSGI based on bytes.

I listed all the advantages in the 'This allows or this is good
because:' section.  Can you explain why they are not real?

> -  it's backwards incompatible in every way, even to CGI.

why is it?  wsgi apps can use an adapter to use it.  wsgi 1.0 servers
can also use an adapter.

> -  it is slow because every dict access would also cause a function
>   call.

As explained above, the transcoding cost can be avoided or reduced,
function calls need to be made anyway (the decode() calls), and
there's also the possibility of using buffers to avoid memory
allocation and allow zero copy networking.

> Furthermore middlewares would most likely start causing
>   circular dependencies when they replace the callable with a new
>   callable and they do not alias the value as a local in the frame
>   that created it.
>

Yes, I think the callables will need a set method... rather than
letting the middleware replace callables.

I think this could be used for middleware:
  environ['SCRIPT_NAME'](set = "/bla/", urldecoding = False, encoding ='utf-8')

but then this(one callable) would probably be better ;)
  environ(what='SCRIPT_NAME', set = "/bla/", urldecoding = False,
encoding ='utf-8')

Since changing the middleware could potentially trigger the rest of
the decoding.  In some situations you would want to avoid reading from
the socket at all.  So middleware changing stuff would mean you would
need to read from the socket(obviously you need to read stuff before
changing it).

Why would you not want to read from the socket at all?  (wsgi 1.0
makes these impossible)
    - to block certain hosts by looking at their ip.
    - you might just care about a connection, like any connection
triggers an action.
    - for load balancing
    - to look at the port number, eg, to check if port 443 is used.
    - if you are overloaded(dos), you want to drop the connection right away.
    - ... others.

So allowing the server to avoid most processing before the application
requests certain data could be a good thing.

So with middleware changing the environ, it means that all those
callables need to be linked to allow the rest of them to know
something has been changed.  So that when one thing is changed, it
drops back to wsgi 1.0 behaviour - that is, some of the encoding is
done just before any change is allowed.

Or maybe middleware has to call a environ['changing']() callable.
Which could then trigger the callables internal transcoding and socket
reading etc.

I'm not sure if it will make middleware harder to use or not still.
I'm working through the function Philip sent to see how it turns out,
and will send an updated proposal after that.