[Python-ideas] Fall back to encoding unicode strings in utf-8 if latin-1 fails in http.client

Thu Jan 7 04:20:35 EST 2016

Hi,

I hope python-ideas is the right place to post this, I'm very new to 
this and appreciate a pointer in the right direction if this is not it.

The requests project is getting multiple bug reports about a problem in 
the stdlib http.client, so I thought I'd raise an issue about it here. 
The bug reports concern people posting http requests with unicode 
strings when they should be using utf-8 encoded strings.

Since RFC 2616 says latin-1 is the default encoding http.client tries 
that and fails with a UnicodeEncodeError.

My idea is NOT to change from latin-1 to something else, that would 
break compliance with the spec, but instead catch that exception, and 
try encoding with utf-8 instead. That would avoid breaking backward 
compatibility, unless someone specifically relied on that exception, 
which I think is very unlikely.

This is also how other languages http libraries seem to deal with this, 
sending in unicode just works:

In cURL (works fine):
curl http://example.com -d "Celebrate 🎉"

In Ruby with http.rb (works fine):
require 'http'
r = HTTP.post("http://example.com", :body => "Celebrate 🎉)

In Node with request (works fine):
var request = require('request');
request.post({url: 'http://example.com', body: "Celebrate 🎉"}, function 
(error, response, body) {
     console.log(body)
})

But Python 3 with requests crashes instead:
import requests
r = requests.post("http://localhost:8000/tag", data="Celebrate 🎉")

...with the following stacktrace:
...
   File "../lib/python3.4/http/client.py", line 1127, in _send_request
     body = body.encode('iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 
14-15: ordinal not in range(256)

----

So the rationale for this idea is:

* http.client doesn't work the way beginners expect for very basic 
usecases (posting unicode strings)
* Libraries in other languages behave like beginners expect, which 
magnifies the problem.
* Changing the default latin-1 encoding probably isn't possible, because 
it would break the spec...
* But catching the exception and try encoding in utf-8 instead wouldn't 
break the spec and solves the problem.

----

Here's a couple of issues where people expect things to work differently:

https://github.com/kennethreitz/requests/issues/1926
https://github.com/kennethreitz/requests/issues/2838
https://github.com/kennethreitz/requests/issues/1822

----

Does this make sense?

/Emil