[Web-SIG] Unicode in Python 3

Sat Sep 19 12:40:48 CEST 2009

Hi,

I spent the last few hours now figuring out what decisions Python took
in the standard library to get a better understanding of unicode in
Python 3 and how it affects web applications.

Let's sum up the current state of encodings in the web world:

RFC 2616 specifies the header encoding as "latin1" (or iso-8859-1).  The
majority of header values is ASCII only, the only exception except for
custom headers and stuff like the server name, are the cookie headers.
Cookie headers are problematic for other reasons as well because some
browsers (IE for example) have different ideas of cookies than others.

I've seen many people using utf-8 encoded cookie values, so it's pretty
common to have headers with values outside the latin1 range.

However to remind everybody: latin1 can carry invalid encoded utf-8
without loss of precision if you do the encode/decode dance.

For URIs/IRIs there is a bit of a problem as well.  URLs are
encodingless but limited to ASCII.  Values outside of the ASCII range
have to be %-encoded, but nowhere is the charset specified.  Browsers
changed the URL encoding behavior to utf-8 a few years ago (I think with
Firefox 1.5 or Firefox 2, Mozilla changed it).  They are still trying
latin1 as well if they are totally clueless and get a 404 or something.
 I'm not exactly sure how that is supposed to work.

The new thing are IRIs.  They can contain any non-ASCII characters and
are considered being UTF-8.  It is possible to quote utf-8 encoded
charpoints with %-encoding.  IRIs might also contain unicode identifiers
for the hostname, for URIs this appears to be idna/puny encoded.  Eg:

   IRI: http://üser:pässword@☃.net/påth
   URI: http://%C3%BCser:p%C3%A4ssword@xn--n3h.net/p%C3%A5th

There are already Python implementations to work convert between URIs
and IRIs (for example in Werkzeug 0.6).

Form data: Form data is encoded by all browsers in the charset of the
page that renders the page.  However for missing encoding declarations
in the HTTP header, the browser runs a character set guessing algorithm.
 This algorithm is currently browser dependent but might be specified as
part of HTML5.  At least there is a section in the draft currently.

This is a lot of charsets.  So for most applications the charsets look
like this:

   page encoding: utf-8
   headers: invalid latin1 with utf-8 payload
   form submissions: utf-8
   urls: utf-8

This is also the only configuration that looks reasonable, all the
others fall to utf-8 on modern browsers every once in a while (for
example if an IRI is used in an HTML document on an external resource,
the browser will try utf-8 for the URL, even if that URL is in fact latin1).

For Python 3, the standard library the safe path and chose utf-8 as
standard encoding for URLs.  The biggest grief I have with this is that
URLs have to be 'str' in Python 3 (remember, that's unicode).  This
works and is probably a step into a better direction, but I would
welcome the addition of an IRI module and advertise the use of IRIs
internally.  (For the 'bytes' problems see further below)

Other situation where the standard library decided to went with unicode
instead of bytes is the HTTP server and clients.  There Python assumes
latin1 for headers (which is correct on the paper).

Unfortunately that complicates things a lot.  Graham is right about
mentioning that operating on bytes in Python 3 is a lot harder than it
was in Python 2.  And I'm not even talking about the missing implicit
conversion, but missing functionality on the bytes.

Here some common idioms found in low-level WSGI code that no longer works:

String formatting:

  >>> b"%d %s" % (200, "OK")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple'

Integer to ASCII:

  >>> bytes(8)
  b'\x00\x00\x00\x00\x00\x00\x00\x00'
  >>> bytes(str(8))
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: string argument without an encoding
  >>> str(8).encode("ascii")
  b'8'

urllib.parse appears to be buggy with bytestrings:

  >>> parse.quote_plus('föö'.encode('utf-8'))
  'f%C3%B6%C3%B6'
  >>> parse.unquote_plus('f%C3%B6%C3%B6')
  'föö'
  >>> parse.unquote_plus(b'f%C3%B6%C3%B6')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\python31\lib\urllib\parse.py", line 404, in unquote_plus
      string = string.replace('+', ' ')
  TypeError: expected an object with the buffer interface

I'm pretty sure the latter is a bug and I will file one, however if
there is broken behavior with bytestrings in Python 3.1 that's another
thing we have to keep in mind.

Form data handling in Python 3 based on cgi.FieldStorage currently also
assumes unicode strings and from what I've read so far, it doesn't work
in Python 3.1, but I have not confirmed that.

In my oppinion it was a mistake to force the unicode behavior on these
parts in the standard library, but now it happened and that affects the
WSGI specification as well now.

Based on what I've read in the code so far, I'm pretty sure we have to
find some statistics about how many non utf-8 applications still exist
in the wild and if we have use cases where the raw bytes are necessary.

Unfortunately the bytes approach does not sound that easy to implement
any more, based on the fact that the standard library no longer supports
bytes for many lower level operations and that the bytes object does not
provide any sort of string formattings.

However, that does not make the unicode approach any less evil.  Unless
we have found a way that properly supports unicode in a way that we're
not losing information and that makes ports of applications possible I'm
strongly against it.

Regards,
Armin