Pure Python Data Mangling or Encrypting

Steven D'Aprano steve at pearwood.info
Thu Jun 25 00:07:33 EDT 2015


On Thu, 25 Jun 2015 04:36 am, Randall Smith wrote:

> On 06/24/2015 07:19 AM, Dennis Lee Bieber wrote:
> 
> 
>> Pardon, but that description has me confused. Perhaps I just don't
>> understand the full use-case.
>>
>> Who exactly is supposed to be protected from what? You state "data
>> senders are supposed to encrypt" which, if the recipient doesn't have the
>> decryption key, implies the recipient -- isn't the real recipient but
>> just a transport/storage place until the data is retrieved by the
>> end-user.
> 
> You got it.  I didn't want to explain any more than necessary.  But yes,
> the recipient just stores the data for the end-user.

Trust me. That's not all they are doing.


>> If "you" do the encryption on the storage machine, then you need to
>> also do the decryption when returning the data to the end-user -- which
>> means the key is available somewhere on the storage machine, and the
>> local user might obtain access to it and the stored data.
> 
> Right again.  A legitimate data owner would encrypt the data.  The
> storage machine is encrypting to protect itself against unwanted
> exposure to unencrypted malware.  Not that they would go looking at the
> files, but their virus scanner or file indexer might.

Okay, you're worrying me now. If this is legitimate business, then you
shouldn't be worried about the virus scanner or file indexer *scanning* the
content of the file.

But giving you the benefit of the doubt, that there's nothing underhanded
happening, I don't think you have a good model for the potential threats in
your software. I think there are at least three different threats:

Sender of the data versus the storage machine:

- the sender of the data may deliberately send malware, intending to attack
the people storing the file;

Storage machine versus the end recipient:

- the storage machine may be infected by malware which corrupts the file;

- the owner of the storage machine may deliberately corrupt the data (this
is a special case of the previous);

- the owner of the storage machine may want to spy on the files, that is,
read the contents without changing the files (attack on privacy).


There may be others threats as well, e.g. man-in-the-middle attacks. If this
is anything like Bittorrent, you have a whole range of threats.

But just sticking to the three above, the first one is partially mitigated
by allowing virus scanners to scan the data, but that implies that the
owner of the storage machine can spy on the files. So you have a conflict
here.

Honestly, the *only* real defence against the spying issue is to encrypt the
files. Not obfuscate them with a lousy random substitution cipher. The
storage machine can keep the files as long as they like, just by making a
copy, and spend hours bruteforcing them. They *will* crack the substitution
cipher. In pure Python, that may take a few days or weeks; in C, hours or
days. If they have the resources to throw at it, minutes. Substitution
ciphers have not been effective encryption since, oh, the 1950s, unless you
use a one-time pad. Which you won't be.

That's assuming they don't just look at the Python source code, grab the
cipher key, and decrypt in seconds.

If you're serious about protecting your users privacy and their data
integrity, you need to use modern strong encryption, and you need to solve
the issue of how to get the key from the trusted source to the untrusted
storage machine. I have no idea how to do that -- you need to talk to
actual security experts, not random Python programmers.

A pure Python solution for the encryption is likely to be too slow for more
than toy files. Bite the bullet and use a library written in C. Python uses
C code for all sorts of modules: math, decimal, bisect, pickle, io, etc.
all delegate to C code when available. There's no shame in it.

Not to put too fine a point on it, using a substitution cipher because it's
easy and fast in pure Python code is like making a boat out of styrofoam
because it's light and floats and using aluminium or fibreglass is too
expensive. Sure that will work for toy applications, like paddling around
the swimming pool in your back yard, but nobody in their right mind would
trust it on the deep ocean or a white-water river.



-- 
Steven




More information about the Python-list mailing list