[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Fri May 20 05:34:40 EDT 2016

Ben Spiller added the comment:

Thanks for considering this, anyway. I'll admit I'm disappointed we couldn't fix this on the 2.7 train, as to me fixing a method that takes an errors='ignore' argument and then throws an exception anyway seems a little more like a bug than a feature (and changing it would likely not affect behaviour in any existing non-broken programs), but if that's the decision then fine. Of course I'm aware (as I mentioned earlier on the thread) that the radically different unicode handling in python 3 solves this entirely and only wish it was practical to move our existing (enormous) codebase and customers over to it, but we're stuck with Python 2.7 - I believe lots of people are in the same situation unfortunately. 

As Josh suggested, perhaps we can at least add something to the doc for the str/unicode encode and decode methods so users are aware of the behaviour without trial and error. I'll update the component of this bug to reflect it's now considered a doc issue. 

Based on the inputs from Terry, and what seem to be the key info that would have been helpful to me and those who are hitting the same issues for the first time, I'd propose the following text (feel free to adjust as you see fit):

For encode:
"For most encodings, the return type is a byte str regardless of whether it is called on a str or unicode object. For example, call encode on a unicode object with "utf-8" to return a byte str object, or call encode on a str object with "base64" to return a base64-encoded str object.

It is _not_ recommended to use call this method on "str" objects when using codecs such as utf-8 that convert betweens str and unicode objects, as any characters not supported by python's default encoding (usually 7-bit ascii) will result in a UnicodeDecodeError exception, even if errors='ignore' was specified. For such conversions the str.decode and unicode.encode methods should be used. If you need to produce an encoded version of a string that could be either a str or unicode object, only call the encode() method after checking it is a unicode object not a str object, using isinstance(s, unicode)."

and for decode:
"The return type may be either str or unicode, depending on which encoding is used and whether the method is called on a str or unicode object. For example, call decode on a str object with "utf-8" to return a unicode object, or call decode on a unicode or str object with "base64" to return a base64-decoded str object.

It is _not_ recommended to use call this method on "unicode" objects when using codecs such as utf-8 that convert betweens str and unicode objects, as any characters not supported by python's default encoding (usually 7-bit ascii) will result in a UnicodeEncodeError exception, even if errors='ignore' was specified. For such conversions the str.decode and unicode.encode methods should be used. If you need to produce a decoded version of a string that could be either a str or unicode object, only call the decode() method after checking it is a str object not a unicode object, using isinstance(s, str)."

----------
components: +Documentation -Interpreter Core

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26369>
_______________________________________