[Baypiggies] clustering

Thu Aug 31 00:47:11 CEST 2006

Shannon -jj Behrens wrote:
> Hey Guys,
>
> I need to do some data processing, and I'd like to use a cluster so
> that I don't have to grow old waiting for my computer to finish.  I'm
> thinking about using the servers I have locally.  I'm completely new
> to clustering.  I understand how to break a problem up into
> paralizable pieces, but I don't understand the admin side of it.  My
> current data set is about 16 gigs, and I need to do things like run
> filters over strings, make sure strings are unique, etc.  I'll be
> using Python wherever possible.
>   
You might need to be a bit more verbose about the specific details of 
your dataset and the processing
that needs to be done.  For example, you are running operations on each 
of the strings that
ultimately have to be collected back at the master node.  Is it as 
simple as partitioning your 16gig
dataset over all the nodes equally, running the map-like operations on 
the cluster nodes, then running the reduce-like operation
as a communication between the master and the cluster nodes?  Or will 
there need to be multiple reductions and redistributions of data?
> * Do I have to run a particular Linux distro?  Do they all have to be
> the same, or can I just setup a daemon on each machine?
>
>   
You can have a heterogenous cluster.  You will have to do a bit more 
work (depending on the variability)

> * What does "Beowulf" do for me?
>
>   
There is no product "Beowulf"; it's more a description of a collection 
of technologies strapped together to make cheap supercomputers.
It's worth reading about because the people who work on beowful clusters 
have typically done much of your homework for you.

> * How do I admin all the boxes without having to enter the same command n times?
>
>   
I use cfengine; define rules which can include commands to be run, files 
to be copied, etc.  You can sit on an admin box
and push updates to all the nodes.

> * I've heard that MPI is good and standard.  Should I use it?  Can I
> use it with Python programs?
>
>   
You can (see MPI Python, and PyPar).  I never thought it was a good 
idea.  MPI is about extracting that last bit
of efficiency out of supercomputers for tasks in which the parallelism 
has to be very tightly coupled to achieve efficiency.
Writing good MPI code is hard; administrating MPI clusters is painful.

Your time is better spent writing a lightweight parallelism interface 
using the existing lightweight networking code in
Python, or in an add-on package like Pyro or Twisted.
> * Is there anything better than NFS that I could use to access the data?
>   
Disks are cheap; fragment your dataset and put chunks on each one.  Heck, you could even put a web server on an admin node, put the data there, and have your clients request parts of the data as necessary.


> * What hip, slick, and cool these days?
>
>   
Err... well, web and grid services get a lot of attention, but you need 
to put a big investment up front
in the infrastructure and design before you see any benefits.  I 
wouldn't really call them hip, slick, or cool.

> I just need you point me in the right direction and tell me what's
> good and what's a waste of time.
>   
I think you should look at the Python documentation for XMLRPC and/or 
Pyro.  Build an ultra-simple XMLRPC
server that runs on all the cluster machines, that allows you to upload 
python code fragments and execute them.
Build another XMLRPC server that runs on the admin machine, when cluster 
machines finish their jobs, just have them
upload their results to the admin machine for final reduction.

Dave