base64.b64encode(data)

Steven D'Aprano steve at pearwood.info
Mon Jun 13 06:35:33 EDT 2016


On Mon, 13 Jun 2016 03:33 pm, Random832 wrote:

> Why do you say these things like you assume I will agree with them.


Because I gave you the benefit of the doubt that you were a reasonable
person open to good-faith discussion, rather than playing John Cleese's
role in your own personal version of the Argument Sketch :-)

I don't mind if you say "Well I'm a telecommunications engineer, and when we
talk about text protocols, this is what I mean." If I ever find myself in a
forum of telco engineers, I'll learn to use their definition too.

But this is a Python forum, and Python 3 is a language that tries very, very
hard to keep a clean separation between bytes and text, where text is
understood to mean Unicode, not a subset of ASCII-encoded bytes. Python 2
was quite happy to let the two categories bleed into each other, with
disastrous effects.

When I first started using computers, the PC world assumed that "text" meant
an ASCII-compatible subset of bytes. One character = one byte, and 'A'
meant byte 0x41 (in hex; in decimal it would be 65). Most of our wire
protocols make that same assumption, and some older file formats (like
HTML) do the same. They're remnants from a bygone age where you could get
away with calling the sequence of bytes 

48 65 6C 6C 6F 20 57 6F 72 6C 64 21

"text", because everyone[1] agreed on the same interpretation of those
bytes, namely "Hello World!". But that's no longer the case, and hasn't
been for, well to be honest it was *never* the case that 0x48 unambiguously
meant 'H', and it is certainly not the case now.

The bottom line is that critical use-cases for base64 involve transmitting
bytes, not writing arbitrary Unicode, and that's why the base64 module is
treated as a bytes to bytes transformation in Python. You can argue with me
all you like, but the docs explicitly call it this:

https://docs.python.org/3/library/codecs.html#binary-transforms

and even in Python 2 it is called "str to str", where str is understood to
be bytes-string, not Unicode:

https://docs.python.org/2/library/codecs.html#standard-encodings



And besides I've only paid for the ten minute argument.





[1] Apart from those guys using IBM mainframes. And people in foreign parts,
where they speak weird outlandish languages with bizarre characters, like
England. And users of earlier versions of ASCII, or users of variants of
ASCII that differ ever so slightly differently.


-- 
Steven




More information about the Python-list mailing list