[Python-ideas] Fall back to encoding unicode strings in utf-8 if latin-1 fails in http.client

Thu Jan 7 06:37:39 EST 2016

On 7 January 2016 at 09:20, Emil Stenström <em at kth.se> wrote:

> This is also how other languages http libraries seem to deal with this,
> sending in unicode just works:
>
> In cURL (works fine):
> curl http://example.com -d "Celebrate 🎉"
>

In a Unix shell, this would be supplying a bytestring argument to the curl
exe, that encoded the characters in whatever language setting the user had
specified (likely UTF-8).

In Windows Powershell (the only Windows shell I can think of that would
support Unicode) what would happen would depend on how curl accessed its
command line. This probably relies on which specific CRT the code was built
with.

> In Ruby with http.rb (works fine):
> require 'http'
> r = HTTP.post("http://example.com", :body => "Celebrate 🎉)
>

I don't know how Ruby handles Unicode, but would that body argument
*actually* be Unicode, or would it be a UTF-8 encoded bytestring? I have a
vague recollection that Ruby uses a "utf-8 for internal string encodings"
model, which may mean it's not as strict as Python 3 is about separating
bytestrings and Unicode strings...

> In Node with request (works fine):
> var request = require('request');
> request.post({url: 'http://example.com', body: "Celebrate 🎉"}, function
> (error, response, body) {
>     console.log(body)
> })
>

Same response here as for Ruby. It depends on the semantics of the language
regarding Unicode support as to what's happening here.

> But Python 3 with requests crashes instead:
> import requests
> r = requests.post("http://localhost:8000/tag", data="Celebrate 🎉")
> ...with the following stacktrace:
> ...
>   File "../lib/python3.4/http/client.py", line 1127, in _send_request
>     body = body.encode('iso-8859-1')
> UnicodeEncodeError: 'latin-1' codec can't encode characters in position
> 14-15: ordinal not in range(256)

What does the requests documentation say it'll do with a Unicode string
being passed as POST data to a request where there's no encoding? If it
says it'll encode as latin-1, then that error is entirely correct. If it
says it'll encode in some other encoding, then it isn't doing so (and
that's a requests bug). If it's not explaining what it's doing, then the
requests documentation is doing its users a disservice by not explaining
the realities of sending Unicode over a byte-oriented protocol - and it's
also leaving a huge "undefined behaviour" hole that people are falling into.

I understand that beginners are confused by the apparent problem that other
environments "just work", but they really don't - and the problems will hit
the user further down the line, when the issue is harder to debug. For
example, you're completely ignoring the potential issue of what the target
server will do when faced with UTF-8 data - there's no guarantee that it
will work in general.

So IMO, this needs to be addressed as a documentation (and possibly code)
fix in requests. It's something of a shame that httplib.client doesn't
reject Unicode strings rather than making a silent assumption of the
encoding, but that's something we have to live with for backward
compatibility reasons. But there's no reason requests has to expose that
behaviour to the user.

Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20160107/34cabdd5/attachment.html>