[Baypiggies] clustering
Carl J. Van Arsdall
cvanarsdall at mvista.com
Tue Sep 5 19:22:30 CEST 2006
Shannon -jj Behrens wrote:
> Hey Guys,
>
> I need to do some data processing, and I'd like to use a cluster so
> that I don't have to grow old waiting for my computer to finish. I'm
> thinking about using the servers I have locally. I'm completely new
> to clustering. I understand how to break a problem up into
> paralizable pieces, but I don't understand the admin side of it. My
> current data set is about 16 gigs, and I need to do things like run
> filters over strings, make sure strings are unique, etc. I'll be
> using Python wherever possible.
>
> * Do I have to run a particular Linux distro? Do they all have to be
> the same, or can I just setup a daemon on each machine?
>
From what I've seen this can vary. For example if you are using PVM
then you should be able to have a heterogeneous cluster without too much
difficulty. Although, personally, for ease of adminsitration, shit like
that, I prefer to keep things (at least on the software side) as similar
as I can. The reality of the cluster is what you make of it
> * What does "Beowulf" do for me?
>
Beowulf isn't so great. There are a number of "active" clustering
technologies going on. I've seen a bit about OpenMosix passed around,
although I believe it exists as kernel patches that are somewhat dated
last time I checked (they were for 2.4 kernels). If you have a lot of
machines etc, you might even want to google load balancing clusters and
see what you get.
> * How do I admin all the boxes without having to enter the same command n times?
>
Check out dsh - dancer's shell. If you are running a debian distro you
can just apt-get it, I use it all the time, a really handy tool.
> * I've heard that MPI is good and standard. Should I use it? Can I
> use it with Python programs?
>
As far as parallel programs go, MPI (and sometimes PVM) tend to be the
best ways to achieve maximum speed although they tend to incur more
development overhead. Lots of people also use combinations of MPI and
OpenMP (or pthreads, whatev, openMP is nice and easy and soon to be
standard in gcc) when they have clusters of smp machines. In my
experience, when you have lots of data to move around it can definitely
be to your advantage to use MPI as you can control specifically how data
will be passed around and setup a network to match that. With 16 gigs
of data you will really want to look at your network topology and how
you choose to distribute the data.
> * Is there anything better than NFS that I could use to access the data?
>
I've seen a number of different ways to do this. You can google
distributed shared file systems, I think there are a couple projects out
there, although I've never used any of them and I'd be very much
interested in anyone's stories if they had any.
> * What hip, slick, and cool these days?
>
You might even check out some grid computing stuff, kinda neat imho.
Also, when you get a cluster up and running with MPI or whatever you
might want to go as far as to profile your code and find the serious
bottlenecks in your application. Check out TAU (Tuning Analysis and
Utilities), it has python bindings as well as MPI/OpenMP stuff. Not
that you will use it, that's just one of those things you can google
should you be bored at work or interested in that typa stuff, and its a
good way to justify to your employer why you need to install infiniband
as your network ;)
> I just need you point me in the right direction and tell me what's
> good and what's a waste of time.
>
Well, as you know you prob want to avoid python threads, although I've
set up a fairly primitive distributed system with python threads and
ssh. Everything is I/O bound for me, so it works really well, although
I'm looking into better distributed technologies. Just more stuff to
play with as we learn (and i'm reading all the links people have posted
in response to your questions too, lots of good stuff)! I'd also be
interested in the solution you choose, so if you ever want to post a
follow up thread I'd be happy to read the results of your project!
-carl
--
Carl J. Van Arsdall
cvanarsdall at mvista.com
Build and Release
MontaVista Software
More information about the Baypiggies
mailing list