Pickle based workflow - looking for advice

Fabien fabien.maussion at gmail.com
Mon Apr 13 10:58:01 EDT 2015


Folks,

I am writing a quite extensive piece of scientific software. Its 
workflow is quite easy to explain. The tool realizes series of 
operations on watersheds (such as mapping data on it, geostatistics and 
more). There are thousands of independent watersheds of different size, 
and the size determines the computing time spent on each of them.

Say I have the operations A, B, C and D. B and C are completely 
independent but they need A to be run first, D needs B and C, and so 
forth. Eventually the whole operations A, B, C and D will run once for 
all, but of course the whole development is an iterative process and I 
rerun all operations many times.

Currently my workflow is defined as follows:

Define a unique ID and file directory for each watershed, and define A 
and B:

def A(watershed_dir):
	# read some external data
	# do stuff
	# Store the stuff in a Watershed object
	# save it
	f_pickle = os.path.join(watershed_dir, 'watershed.p')
	with open(f_pickle, 'wb') as f:
		pickle.dump(watershed, f)

def B(watershed_directory):
	w = pickle.read()
	f_pickle = os.path.join(watershed_dir, 'watershed.p')
	with open(f_pickle, 'rb') as f:
		watershed = pickle.load(f)
	# do new stuff
	# store it in watershed and save
	with open(f_pickle, 'wb') as f:
		pickle.dump(watershed, f)

So the watershed object is a data container which grows in content. The 
pickle that stores the info can reach a few Mb of size. I chose this 
strategy because A, B, C and D are independent, but they can share their 
results through the pickle. The functions have a single argument (the 
path to the working directory), which means that when I run the 
thousands catchments I can use the multiprocessing pool:

	import multiprocessing as mp
	poolargs = [list of directories]
         pool = mp.Pool()
         poolout = pool.map(A, poolargs, chunksize=1)
         poolout = pool.map(B, poolargs, chunksize=1)
	etc.

I can easily choose to rerun just B without rerunning A. Reading and 
writing pickle times is real slow in comparison to the other stuffs to 
do (running B or C on a single catchment can take seconds for example).

Now, to my questions:
1. Does that seem reasonable?
2. Should Watershed be an object or should it be a simple dictionary? I 
thought that an object could be nice, because it could take care of some 
operations such as plotting and logging. Currently I defined a class 
Watershed, but its attributes are defined and filled by A, B and C (this 
seems a bit wrong to me). I could give more responsibilities to this 
class but it might become way too big: since the whole purpose of the 
tool is to work on watersheds, making a Watershed class actually sounds 
like a code smell (http://en.wikipedia.org/wiki/God_object)
3. The operation A opens an external file, reads data out of it and 
writes it in Watershed object. Is it a bad idea to multiprocess this? (I 
guess it is, since the file might be read twice at the same time)
4. Other comments you might have?

Sorry for the lengthy mail but thanks for any tip.

Fabien




	



More information about the Python-list mailing list