base64.b64encode(data)

Mon Jun 13 01:16:05 EDT 2016

On Mon, 13 Jun 2016 01:20 pm, Random832 wrote:

> On Sun, Jun 12, 2016, at 22:22, Steven D'Aprano wrote:
>> That's because base64 is a bytes-to-bytes transformation. It has
>> nothing to do with unicode encodings.
> 
> Nonsense. base64 is a binary-to-text encoding scheme. The output range
> is specifically chosen to be safe to transmit in text protocols.

"Safe to transmit in text protocols" surely should mean "any Unicode code
point", since all of Unicode is text. What's so special about the base64
ones?

Well, that depends on your context. For somebody who cares about sending
bits over a physical wire, their idea of "text" is not Unicode, but a
subset of ASCII *bytes*.

The end result is that after you've base64ed your "binary" data, to
get "text" data, what are you going to do with is? Treat it as Unicode code
points? Probably not. Squirt it down a wire as bytes? Almost certainly.
Looking at this from the high-level perspective of Python, that makes it
conceptually bytes not text.

Yes, I know that there's a terminology clash between communication engineers
and the programmers who work in their world, and the rest of us. We
use "text" to mean Unicode[1], they use "text" to mean "roughly 100 of the
128 bytes with the high-bit cleared, interpreted as ASCII".

But those folks are unlikely to be asking why base64 encoding a bunch of
bytes returns bytes. They *want* it to return bytes, because that's what
they're going to squirt down the wire. If you gave them Unicode, encoded
using (say) UTF-16 or UTF-32, they're likely to say "WTF are you giving me
this binary data for? Look at all these NUL bytes, what am I supposed to do
with them?!?!". (If they could cope with arbitrary bytes, they wouldn't
have base64 encoded it.) And if you gave them UTF-8, well, how would anyone
know? With base64 encoded data, it's all a subset of ASCII.

Python defines a nice clean separation between text (Unicode) and binary
data (bytes). Under that model, base64 is a transformation between
unrestricted bytes 0...255 to a restricted subset of bytes that matches
some ASCII encoded text. It shouldn't return a Unicode string, because
that's an abstract text format and we can't make any assumptions about the
implementation. Say you base64 encode some binary data:

py> base64.b64encode(b'\x01A\x11\x16')
b'AUERFg=='

Suppose instead it returned the Unicode string 'AUERFg=='. That's all well
and good, but what are you going to do with it? You can't transmit it over
a serial cable, because that almost surely is going to expect bytes, so you
have to encode it. You can't embed it in an email, because that also
expects bytes.

You could write it to a file. If the file is opened in binary mode, you have
to encode the Unicode string to bytes before you can write it. If the file
is opened in text mode, Python will accept your Unicode string and encode
it for you, which could introduce non-base64 characters into the file.
Consider if the file was opened using UTF-16:

\x00A\x00U\x00E\x00R\x00F\x00g\x00=\x00=

hardly counts as base64 in any meaningful sense.

So while I complete accept your comment about "text protocols" in the
context of the networking world, we're not in the networking world. We're
in the high-level programming language world of Python, where text does not
mean a subset of ASCII bytes, it means Unicode. And in *our* world, having
base64 return text is a mistake.

[1] Or at least we should, since the idea that only American English[2]
counts as text cannot possibly survive in the 21st Century when we're
connected to the entire world of different languages. Although I'd allow
TRON as well, if you can actually find any TRON users outside of Japan.[3]

[2] And only a subset of American English at that.

[3] Or inside Japan for that matter.

-- 
Steven