python in parallel for pattern discovery in genome data

Andrew Dalke adalke at mindspring.com
Wed Jul 30 22:14:23 EDT 2003


BalyanM:
> I am interested to run python on a sun machine(SunE420R,os=solaris)
> with 4 cpu's for a pattern discovery/search program on biological
> sequence(genomic sequence).I want to write the python code so that it
> utilizes all the 4 cpu's.

  *oomphh*

There's a lot of details buried in your lines.

It looks like you will be writing your own pattern matching code.
Why?  There are plenty of tools for that already.  A quick web
search finds http://genome.imb-jena.de/seqanal.html and many
of those tools are freely available.

Okay, suppose you do have the tool or library for it.  Do you
want to do high throughput searches?  Then you can just break
your N jobs into N/4 parts, one per machine.  Easiest way in
Python is to run 4 Python programs, each with a little server going
(see the xmlrpc module for an example) and have your code
call them (see Aahz's excellent example of master/slave
programming using threads).  Other options for the communications
are Twisted and Pyro.

You will not be able to do this with one Python process because
Python has what's called the "global interpreter lock" that
prevents core Python from effectively using multiple processors.
You can write a C extension which does the search and gives
up the lock, but I you seem to want to do this in raw Python.

(The suggestion to look at POSH won't work - it has some
Intel-specific assembly instructions in the C extension.)

Depending on the type of pattern search, you instead can assign
1/4 of the genome to each process, with overlap if needed.  This
will speed up a single search, which is good for interactivity.

These work for a single "user" of the code.  Might you have
many people trying to do pattern searches?  If so, you may
need some way to throttle how many searches are done per
machine.  For in-house use this likely isn't a problem - besides,
you should get your code working first.

There are other approaches.  You could use shared memory or
CORBA for the communications, or PVM or MPI.  Still, given
your experience, you should:
  1) get your algorithm working on one machine
  2) get it working as a client/server using XML-RPC (see the
       SimpleXMLRPCServer and xmlrpclib modules),
  3) get your client to work with multiple servers,
          using multiple threads in the client

(It's a bit of my experience too - I really should try Pyro
for this sort of work.  Well, I need a break so maybe I'll
try it out tonight ;)

There are a lot of skills to learn before it all works, so don't
get too discouraged too quickly.

                    Andrew
                    dalke at dalkescientific.com






More information about the Python-list mailing list