[Python-ideas] Fall back to encoding unicode strings in utf-8 if latin-1 fails in http.client

Thu Jan 7 05:07:41 EST 2016

> On 7 Jan 2016, at 09:20, Emil Stenström <em at kth.se> wrote:
> 
> Since RFC 2616 says latin-1 is the default encoding http.client tries that and fails with a UnicodeEncodeError.

I cannot stress this enough: there is *no* default encoding for HTTP bodies!

This conversation is very confused, and it all starts because of a thoroughly misleading comment in http.client.

Firstly, let’s all remember that RFC 2616 is dead (hurrah!), now superseded by RFCs 7230 through 7238. However, http.client blames its decision on RFC 2616. Note the comment here[0]. This is (in my view) a *misreading* of RFC 2616 Section 3.7.1, which says:

> When no explicit charset
> parameter is provided by the sender, media subtypes of the “text"
> type are defined to have a default charset value of "ISO-8859-1" when
> received via HTTP.

The thing is, this paragraph is referring to MIME types: that is, when the Content-Type header reads “text/<something>”, and specifies no charset parameter, the body should be encoded in UTF-8.

That, of course, is not the invariant this code enforces. Instead, this code spots the *only* explicit reference to a text encoding and chooses to use it for any unicode string sent by the user. That’s a somewhat defensible decision, though it’s not the one I’d have made.

*However*, that fallback was removed in RFC 7231. In appendix B of that RFC, we see this note:

> The default charset of ISO-8859-1 for text media types has been
> removed; the default is now whatever the media type definition says.
> Likewise, special treatment of ISO-8859-1 has been removed from the
> Accept-Charset header field.

This means there is no longer a default content encoding for HTTP, and instead the default encoding varies based on media type. The relevant RFC for this is RFC 6657, which specifies the following things:

- The default encoding for text/plain is US-ASCII
- All other text subtypes either MUST provide a charset parameter that explicitly indicates what their encoding is, or MUST NOT provide one under any circumstances and instead carry that information in their contents (e.g. HTML, XML). That is to say, there are no defaults for text/* encodings: only explicit encoding choices!

This whole thing was really very confusing from the beginning. IMO, the only safe decision is for http.client to simply refuse to accept unicode strings *at all* as request bodies: the ambiguity over what they mean is simply too great. Requests has had a large number of bug reports from people who claimed that something “didn’t work”, when in practice there was just a disagreement over what the correct encoding of something was. And having written both a HTTP/1.1 and a HTTP/2 client myself, in both cases I restricted the arguments of HTTPConnection.send() to bytestrings.

For what it’s worth, I don’t believe it’s a good idea to change the default body encoding of unicode strings. This is the kind of really perplexing change that takes working code that implicitly relies on this behaviour and breaks it. In my experience, breakage of this manner is particularly tricky to catch because anything that can be validly encoded as Latin-1 can be validly encoded as UTF-8, so the failure will manifest as request failures rather than tracebacks. In this instance I believe the http.client module has made its bed, and will need to lie in it.

If this *did* change, Requests would (at least for the remainder of the 2.X release cycle) need to enforce the Latin-1 behaviour itself for the very same backward compatibility reasons, which removes any benefit we’d get from this anyway.

The really correct behaviour would be to tell users they cannot send unicode strings, because it makes no sense. That’s a change I could get behind. But moving from one guess to another, even though the new guess is more likely to be right, seems to me to be misunderstanding the problem.

Cory

N.B: I should note that only one of the linked requests issues, #2838, is actually about the request body. Of the others, one is about unicode in the request URI and one is about unicode in header values. This set of related issues demonstrates an ongoing confusion amongst users about what unicode strings are and how they work, but that’s a separate discussion to this one.

[0]: https://github.com/python/cpython/blob/master/Lib/http/client.py#L1173-L1176
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20160107/d948a2b2/attachment.sig>