base64.b64encode(data)

Steven D'Aprano steve at pearwood.info
Tue Jun 21 11:56:31 EDT 2016


On Mon, 13 Jun 2016 11:36 pm, Random832 wrote:

> On Mon, Jun 13, 2016, at 06:35, Steven D'Aprano wrote:
>> But this is a Python forum, and Python 3 is a language that tries
>> very, very hard to keep a clean separation between bytes and text,
> 
> Yes, but that doesn't mean that you're right 

As you already know, but others might not, I asked on the Python-Dev list
why b64encode has the behaviour it has:

https://mail.python.org/pipermail/python-dev/2016-June/145166.html

**Even if** your interpretation of RFC-989 etc are correct, Python is not
bound to follow their interpretation. The RFC is a network protocol, Python
is a programming language, and our libraries can do whatever makes sense
for *programming*. And the people who migrated the Python 2 base64 lib to
Python 3 thought that it made more sense to have the functions operate on
bytes and return bytes. Other languages have made other choices:

Microsoft's base64 library in C#, C++, F# and VB takes an array of bytes as
input, and outputs a UTF-16 string:

https://msdn.microsoft.com/en-us/library/dhx0d524%28v=vs.110%29.aspx

Java's base64 encoder takes and returns bytes:

https://docs.oracle.com/javase/8/docs/api/java/util/Base64.Encoder.html

Javascript's Base64 encoder takes input as UTF-16 encoded text and returns
the same:

https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding


RFC 989 says that their unnamed "Encode to Printable Form" uses
implementation independent characters:

  The bits resulting from the encryption operation are encoded 
  into characters which are universally representable at all 
  sites, though not necessarily with the same bit patterns (e.g., 
  although the character "E" is represented in an ASCII-based 
  system as hexadecimal 45 and as hexadecimal C5 in an EBCDIC-based
  system, the local significance of the two representations is 
  equivalent).

https://tools.ietf.org/html/rfc989


But I'm not sure how RFC 989 intends this to work in practice. If you
encrypt and encode a message on an EBCDIC machine, and the output consists
of an "E" (i.e. 0xC5, and you transmit it to an ASCII machine where you try
to decode it, it will be interpreted as an eight-bit non-ASCII character,
*not* as "E". In order for this to work, you need an additional step that
transfers byte 0xC5 (EBCDIC "E") into byte 0x45 (ASCII "E") otherwise you
get junk.

That's okay for email, since email is sent in US-ASCII[1], so any EBCDIC
machine wanting to send email must convert the header and bodies into
US-ASCII, including any Base64 attachments. But the relevance of this to
Python is pretty low.



> At
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uuencode.html

Python's base64 module is not a re-implementation of the POSIX utility
uuencode. The uuencode utility is an application, not a library. It has its
own reasons for writing text files encoding using the local environment's
default encoding, and it explicitly states that when moving such files to
another system, they must be translated:

  [quote]
  If it was transmitted over a mail system or sent to a machine with a
  different codeset, it is assumed that, as for every other text file, 
  some translation mechanism would convert it (by the time it reached a
  user on the other system) into an appropriate codeset.
  [end quote]

In any case, the POSIX utility uuencode is free to implement whatever
high-level behaviour its authors like, just as programming language
designers are free to design their Base64 libraries to work how they like.






[1] With a few exceptions, such as binary attachments, although not all mail
servers can deal with them.


-- 
Steven




More information about the Python-list mailing list