Multiprocessing performance question

Wed Feb 20 19:15:40 EST 2019

def create_box(x_y):
    return geometry.box(x_y[0] - 1, x_y[1],  x_y[0], x_y[1] - 1)

x_range = range(1, 1001)
y_range = range(1, 801)
x_y_range = list(itertools.product(x_range, y_range))

grid = list(map(create_box, x_y_range))

Which creates and populates an 800x1000 “grid” (represented as a flat list
at this point) of “boxes”, where a box is a shapely.geometry.box(). This
takes about 10 seconds to run.

Looking at this, I am thinking it would lend itself well to
parallelization. Since the box at each “coordinate" is independent of all
others, it seems I should be able to simply split the list up into chunks
and process each chunk in parallel on a separate core. To that end, I
created a multiprocessing pool:

pool = multiprocessing.Pool()

And then called pool.map() rather than just “map”. Somewhat to my surprise,
the execution time was virtually identical. Given the simplicity of my
code, and the presumable ease with which it should be able to be
parallelized, what could explain why the performance did not improve at all
when moving from the single-process map() to the multiprocess map()?

I am aware that in python3, the map function doesn’t actually produce a
result until needed, but that’s why I wrapped everything in calls to
list(), at least for testing.

The reason multiprocessing does not speed things up is the overhead of
pickling/unpickling objects. Here are results on my machine, running
Jupyter notebook:

def create_box(xy):
    return geometry.box(xy[0]-1, xy[1], xy[0], xy[1]-1)

nx = 1000
ny = 800
xrange = range(1, nx+1)
yrange = range(1, ny+1)
xyrange = list(itertools.product(xrange, yrange))

%%time
grid1 = list(map(create_box, xyrange))

CPU times: user 9.88 s, sys: 2.09 s, total: 12 s
Wall time: 10 s

%%time
pool = multiprocessing.Pool()
grid2 = list(pool.map(create_box, xyrange))

CPU times: user 8.48 s, sys: 1.39 s, total: 9.87 s
Wall time: 10.6 s

Results exactly as yours. To see what is going on, I rolled out my own
chunking that allowed me to add some print statements.

%%time
def myfun(chunk):
    g = list(map(create_box, chunk))
    print('chunk', chunk[0], datetime.now().isoformat())
    return g

pool = multiprocessing.Pool()
chunks = [xyrange[i:i+100*ny] for i in range(0, nx*ny, 100*ny)]
print('starting', datetime.now().isoformat())
gridlist = list(pool.map(myfun, chunks))
grid3 = list(itertools.chain(*gridlist))
print('done', datetime.now().isoformat())

starting 2019-02-20T23:03:50.883180
chunk (1, 1) 2019-02-20T23:03:51.674046
chunk (701, 1) 2019-02-20T23:03:51.748765
chunk (201, 1) 2019-02-20T23:03:51.772458
chunk (401, 1) 2019-02-20T23:03:51.798917
chunk (601, 1) 2019-02-20T23:03:51.805113
chunk (501, 1) 2019-02-20T23:03:51.807163
chunk (301, 1) 2019-02-20T23:03:51.818911
chunk (801, 1) 2019-02-20T23:03:51.974715
chunk (101, 1) 2019-02-20T23:03:52.086421
chunk (901, 1) 2019-02-20T23:03:52.692573
done 2019-02-20T23:04:02.477317
CPU times: user 8.4 s, sys: 1.7 s, total: 10.1 s
Wall time: 12.9 s

All ten subprocesses finished within 2 seconds. It took about 10 seconds to
get back and assemble the partial results. The objects have to be packed,
sent through network and unpacked. Unpacking is done by the main (i.e.
single) process. This takes almost the same time as creating the objects
from scratch. Essentially the process does the following:

%%time
def f(b):
    g1 = b[0].__new__(b[0])
    g1.__setstate__(b[2])
    return g1
buf = [g.__reduce__() for g in grid1]
grid4 = [f(b) for b in buf]

CPU times: user 20 s, sys: 411 ms, total: 20.4 s
Wall time: 20.3 s

The first line creates the pickle (not exactly, as pickled data is a single
string, not a list). The second line is what pickle.loads() does.

I do not think numpy will help here. The Python function box() has to be
called 800k times. This will take time. np.vectorize(), as the
documentation states, is provided only for convenience, it is implemented
with a for loop. IMO vectorization would have to be done on C level.

Greetings from Anchorage

George