[Python-ideas] Fall back to encoding unicode strings in utf-8 if latin-1 fails in http.client

Thu Jan 7 11:32:45 EST 2016

Thanks especially to Cory for digging into the source and the RFCs here!

Personally I'm perplexed that Requests, which claims to be "HTTP for
Humans" doesn't take care of this but just lets http/client.py blow up.
(However, IIUC both 2838 and 1822 are about the body.encode() call in
Python 3's http/client.py at _send_request(). 1926 seems to originate in
Requests itself; it's also Python 2.7.)

Anyways, if we were to follow the Python 3 philosophy regarding Unicode to
the letter we would have to reject the str type altogether here, and insist
on bytes. The error message could tell the caller what to do, e.g. "use
data.encode('utf-8') if you want the data to be encoded in UTF-8". (Then of
course the server might not like it.)

An alternative could be to look at the content-type header (if one is
given) and use the charset from there or the default from the RFC for the
content/type.

But all these are rather painfully backwards incompatible, which is a big
concern here.

Maybe the best solution (most backward compatible *and* most likely to stem
the flood of bug reports) is to just catch the UnicodeError and replace its
message with something more Human-friendly, explaining that the data must
be encoded before sending it. Then the user can figure out what encoding to
use (though yes, most likely UTF-8 is it, so the message could suggest
trying that first).

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20160107/35c2a86d/attachment-0001.html>