Is setdefaultencoding bad?

Wed Feb 23 21:24:58 EST 2011

On Wed, 23 Feb 2011 04:14:29 -0800, Chris Rebert wrote:

>> Ok, but that the interface handles UTF-8 strings
>> are still ok? The defaultencoding is still ascii.
> 
> Yes, that's fine. UTF-8 is an excellent encoding choice, and
> encoding/decoding should always be done explicitly in Python, so the
> "default encoding" ideally ought to never come into play (and indeed,
> Python 3 does away with bug-prone implicit encoding/decoding entirely
> FWICT).

On Unix, you have to go out of your way to avoid the use of implicit
encoding/decoding with the "filesystem" encoding. This is because Unix
extensively uses byte strings with no associated encoding, but Python 3
tries to use Unicode for everything.

3.0 was essentially unusable as a Unix scripting language for this reason,
as argv and environ were converted to Unicode, with no possibility of
recovering from unconvertible sequences.

3.1 added the surrogate-escape mechanism which allows recovery of the
original byte sequences, albeit with some effort (i.e. you had to
explicitly decode os.environ and sys.argv).

3.2 adds os.environb (bytes version of os.environ), but it appears that
sys.argv still has to be encoded manually. It also provides os.fsencode()
and os.fsdecode() to simplify the conversion.

Most functions accept bytes arguments, most either return bytes when
passed bytes or (if the function accepts no arguments) has a bytes
equivalent. But variables tend to be Unicode strings with no bytes version
(os.environb is the exception rather than the rule), and some functions
have no bytes equivalent (e.g. os.ctermid(), os.uname(), os.ttyname();
fortunately it's rather unlikely that the result from any of these
functions will contain non-ASCII characters).