[Web-SIG] Unicode in Python 3

René Dudfield renesd at gmail.com
Sat Sep 19 14:26:13 CEST 2009


On Sat, Sep 19, 2009 at 11:59 AM, Armin Ronacher
<armin.ronacher at active-4.com> wrote:
> Hi,
>
> Armin Ronacher schrieb:
>> urllib.parse appears to be buggy with bytestrings:
>>
>> I'm pretty sure the latter is a bug and I will file one, however if
>> there is broken behavior with bytestrings in Python 3.1 that's another
>> thing we have to keep in mind.
> I have to correct myself, there are separate functions for byte quoting.
> (parse.unquote_to_bytes, parse.quote_from_bytes).
>
>

Hi,

I think that shows that they are being handled differently depending
on type.  Which is against polymorphism... but some people prefer to
have separate functions for different types(in and out).  I don't
think other python functions do this though.  So maybe this is a one
off, and could be considered a bug... I'm not sure why they did it
this way.

Here is a snippet from the compat.py we used to port pygame to support
python2.3 through 3.1

try:
    unicode_ = unicode
except NameError:
    unicode_ = str



You can see that then alows you to do this:

>>> print( unicode_(b'sdf %s %s') % ('sdf', 'ef'))
b'sdf sdf ef'

>>> ord(unicode_('ÿ'))
255


This allows your code to have (somewhat) the same behavior for unicode
on both 2.x and 3.x.  Using b'' in your code makes it impossible to
share the same code base with 2.x and 3.x.


In summary of the arguments (please add if I've missed something):



Arguments against using bytes (and using unicode instead).
==============================================

So I'm -1 on using b'' all over the place since it's not in both
versions of python, and makes it impossible for code bases to share
the same code for multiple versions of python.

Armins code example shows how ugly it is to convert code with b'' all
over the place, and how it doesn't support many operations that
strings do in python2.x. -1 for that reason.  Also I think the sneaky
version shows the same thing with regards to b''.

Since 2to3 also uses unicode instead of bytes I'm -1 on using b''.

The python API also uses unicode in it's API as Armin has shown, and
not bytes.  So another reason for -1 on b''.


Argument for using bytes:
====================
    socket methods return bytes in py3k...

Well, they do with recvfrom etc... but not recvfrom_into.
recvfrom_into and friends put the bytes into a given buffer.   ((As an
off topic, we should be designing for these functions as they allow a
zero-copy, and zero-memory-allocation method of web server creation in
python.))
  'socket.recvfrom_into(buffer[, nbytes[, flags]])'  this is new from python2.5.



A work around - and suggested solution.
===============================

Use unicode by default, but make another key available with raw data.

So to work around the problem of (rarely/occasionally) needing the raw
bytes why don't we just have raw buffer keys in the environ?  This
solves the case where it is needed in rare situations, and also makes
the common situation (using correctly decoded unicode strings)
possible?

I would suggest not using bytes as the raw key, but instead a raw
`buffer` object.  This makes it possible to use the zero-copy,
zero-memory-allocation methods.  array.array is suitable here, more
suitable than python3 bytes - since it is supported in older versions
of python as well.  Or other forms of buffer should be usable too...
eg, an mmap, or a special apache buffer type, or numpy array, pygame
surface buffer, PIL buffer etc.

This solution optimises for:
  - compatibility with older pythons when using the same code base.
  - compatibility with older wsgi applications.
  - and also with the 2to3 tool trans
  - ease of use in the most common cases.
  - similarity to other python API web stuff using unicode in python 3.1.
  - similarity to higher level frameworks like django, webobj etc that
expose unicode.
  - possibility to access raw data when needed (in rare situations)
  - possibility to write more performant code if required (with new
functions introduced since python2.3 and wsgi 1.0 were introduced).


More information about the Web-SIG mailing list