[Python-ideas] Why decode()/encode() name is harmful

Mon Jun 1 23:56:46 CEST 2015

On Jun 1, 2015, at 08:46, anatoly techtonik <techtonik at gmail.com> wrote:
> 
>> On Sat, May 30, 2015 at 3:18 AM, Steven D'Aprano <steve at pearwood.info> wrote:
>> 
>> As far as I can see, he has been given the solution, or at least a
>> potential solution, on python-list, but as far as I can tell he either
>> hasn't read it, or doesn't like the solutions offerred and so is
>> ignoring them.
> 
> Let me update you on this. There was no solution given. Only the
> pointers to go read some pointers on the internets again. So, yes,
> I read replies. But I have very little time to analyse and follow up.

Hold on. You had a question, you don't have time to read the answers you were given, so instead you think Python needs to change?

> The idea I wanted to convey in this thread is that encode/decode
> is confusing, so if you agree with that, I can start to propose
> alternatives.
> 
> And just to make you understand the importance of the question
> with translating from bytes to unicode and back, let me just tell
> that this question is the third one voted with 221k views on SO in
> Python 3 tag.

First, as multiple people including the OP say in the comments to that question, what's confusing to novices is that subprocess pipes are the first thing they've used that are binary by default instead of text by default. (For other novices that will instead happen with sockets. But it will eventually happen somewhere.) So, maybe the subprocess docs need a prominent link to, say, the Unicode HOWTO, which is what the OP of that question seems to be proposing. Or maybe it should just be easier to open subprocess pipes in text mode, as it is for files.

But I don't see how renaming the methods could possibly help anything. The problem is not that the OP saw the answer and didn't understand or believe it, it's that he didn't know how to search for it. When told the right answer, he immediately said "Thanks, that does it" not "Whatchootalkinbout Willis, I don't have any crypto here". I've never heard of anyone besides you having that reaction.

Also, your own answer there is a really bad idea. It was an intentional part of the design of UTF-8 that decoding non-UTF-8 non-ASCII text as if it were UTF-8 will almost always signal an error. It's not a good thing to silently get mojibake instead of getting an error--it just pushes the problem back further, to someone it's harder to understand, find, and debug. In the worst case, it just pushes the problem all the way to the end user, who's even less equipped to deal with it than you when his Russian characters get turned into box graphics. If you have bytes and you want text, the only solution to that is to find out the encoding and decode it. That's not a problem with Python, it's a problem with the proliferation of incompatible encodings that people have used without any in-band or out-of-band indications over the past few decades.

Of course there are cases where you want to smuggle bytes with text, or degrade as gracefully as possible on errors, or whatever. That's why decode takes an error handler. But in the usual case, if you try to interpret something as UTF-8 when it's really cp1252, or interpret something as Big5 when it's really Shift-JIS, or whatever, an error is exactly what you should hope for, to tell you that you guessed wrong. That's why it's the default.

> http://stackoverflow.com/questions/tagged/python-3.x
> 
> -- 
> anatoly t.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/