[Web-SIG] WSGI, Python 3 and Unicode

Fri Dec 7 10:39:09 CET 2007

[Phillip]
>> WSGI already copes, actually.  Note that Jython and IronPython have
>> this issue today, and see:
>>
>> http://www.python.org/dev/peps/pep-0333/#unicode-issues

[James]
> It would seem very odd, however, for WSGI/python3 to use strings-
> restricted-to-0xFF for network I/O while everywhere else in python3 is
> going to use bytes for the same purpose.

I think it's worth pointing out the reason for the current restriction
to iso-8859-1 is *because* python did not have a bytes type at the
time the WSGI spec was drawn up. IIRC, the bytes type had not yet even
been proposed for Py3K. Cpython effectively held all byte sequences as
strings, a paradigm which is (still) followed by jython (not sure
about ironpython).

The restriction to iso-8859-1 is really a distraction; iso-8859-1 is
used simply as an identity encoding that also enforces that all
"bytes" in the string have a value from 0x00 to 0xff, so that they are
suitable for byte-oriented IO. So, in output terms at least, WSGI *is*
a byte-oriented protocol. The problem is the python-the-language
didn't have support for bytes at the time WSGI was designed.

[James]
> You'd have to modify your app
> to call write(unicodetext.encode('utf-8').decode('latin-1')) or so....

Did you mean: write(unicodetext.encode('utf-8').encode('latin-1'))?

Either way, the second encode is not required;
write(unicodetext.encode('utf-8')) is sufficient, since it will
generate a byte-sequence(string) which will (actually "should": see
(*) note below) pass the following test.

try:
   wsgi_response_data.encode('iso-8859-1')
except UnicodeError:
   # Illegal WSGI response data!

On a side note, it's worth noting that Philip Jenvey's excellent
rework of the jython IO subsystem to use java.nio is fundamentally
byte oriented.

http://www.nabble.com/fileno-support-is-not-in-jython.-Reason--t4750734.html
http://fisheye3.cenqua.com/browse/jython/trunk/jython/src/org/python/core/io

Because it is based on the new IO design for Python 3K, as described in PEP 3116

http://www.python.org/dev/peps/pep-3116/

Regards,

Alan.

[*] Although I notice that cpython 2.5, for a reason I don't fully
understand, fails this particular encoding sequence. (Maybe it's to do
with the possibility that the result of an encode operation is no
longer an encodable string?)

Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> response = u"interferon-gamma (IFN-\u03b3) responses in cattle"
>>> response.encode('utf-8').encode('latin-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position
22: ordinal not in range(128)
>>>

Meaning that to enforce the WSGI iso-8859-1 convention on cpython 2.5,
you would have to carry out this rigmarole

>>> response.encode('utf-8').decode('latin-1').encode('latin-1')
'interferon-gamma (IFN-\xce\xb3) responses in cattle'
>>>

Perhaps this behaviour is an artifact of the cpython implementation?

Whereas jython passes it just fine (and correctly, IMHO)

Jython 2.2.1 on java1.4.2_15
Type "copyright", "credits" or "license" for more information.
>>> response = u"interferon-gamma (IFN-\u03b3) responses in cattle"
>>> response.encode('utf-8')
'interferon-gamma (IFN-\xCE\xB3) responses in cattle'
>>> response.encode('utf-8').encode('latin-1')
'interferon-gamma (IFN-\xCE\xB3) responses in cattle'
>>>