FIle transfer over network - with Pyro?

Sat Jun 5 13:14:13 EDT 2010

On Thu, 03 Jun 2010 20:05:15 +0000, exarkun wrote:

> On 06:58 pm, strombrg at gmail.com wrote:
>>On Jun 3, 10:47 am, Nathan Huesken <pyt... at lonely-star.org> wrote:
>>>Hi,
>>>
>>>I am writing a network application which needs from time to time do
>>>file transfer (I am writing the server as well as the client). For
>>>simple network messages, I use pyro because it is very comfortable.
>>>But I suspect, that doing a file transfer is very inefficient over
>>>pyro, am I right (the files are pretty big)?
>>>
>>>I somehow need to ensure, that the client requesting a file transfer is
>>>the same client getting the file. So some sort of authentication is
>>>needed.
>>>
>>>What library would you use to do the file transfer? Regards,
>>>Nathan
>>
>>I've never used Pyro, but for a fast network file transfer in Python,
>>I'd probably use the socket module directly, with a cache oblivious
>>algorithm:
>>   http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
>>
>>It doesn't use sockets, it uses files, but I recently did a Python
>>progress meter application that uses a cache oblivious algorithm that
>>can get over 5 gigabits/second throughput (that's without the network in
>>the picture, though if it were used on 10 Gig-E with a suitable
>>transport it could probably do nearly that), on a nearly-modern PC
>>running Ubuntu with 2 cores  It's at:
>>   http://stromberg.dnsalias.org/~strombrg/gprog/ .
> 
> This seems needlessly complicated.  Do you have a hard drive that can
> deliver 5 gigabits/second to your application?  More than likely not.

Most such programs aren't optimized well for one machine, let alone 
adapting well to the cache-related specifics of about any transfer - so 
the thing you're using to measure performance, instead becomes the 
bottleneck itself.  I don't think I'd use an oral thermometer that gave a 
patient a temporarily higher fever, and it'd be nice if I didn't have to 
retune the thermometer for each patient, too.

Besides, it's a _conceptually_ simple algorithm - keep the n best-
performing block sizes, and pick the best one historically for subsequent 
writes, trying a different, random blocksize once in a while even if 
things are going well with the current blocksize.  It's actually 
something I learned about as an undergrad from a favorite professor, who 
was a little insistent that hard coding a "good" block size for the 
specifics of a single machine was short sighted when you care about 
performance, as code almost always moves to a different machine (or a 
different disk, or a different network peer) eventually.  Come to think 
of it, she taught two of my 3 algorithms classes.  Naturally, she also 
said that you shouldn't tune for performance unnecessarily.

> A more realistic answer is probably to use something based on HTTP. This
> solves a number of real-world problems, like the exact protocol to use
> over the network, and detecting network issues which cause the transfer
> to fail.  It also has the benefit that there's plenty of libraries
> already written to help you out.

Didn't the OP request something fast?  HTTP code is prone to be 
"optimized" for small transfers (if that), as most of the web is small 
files.

OP: I should mention: If you're on gigabit or better, you probably should 
speak with your sysadmin about enabling Jumbo Frames and Path MTU 
Discovery - otherwise, even a cache oblivious algorithm likely won't be 
able to help much - the CPU would likely get pegged too early.  If, on 
the other hand, you only care about 10BaseT speeds, or perhaps even 
100BaseT speeds, HTTP would probably be fine (a typical CPU today can 
keep up with that fine), especially if you're doing a single transfer at 
a time.