[Numpy-discussion] Followup on Python+MPI import performance

Asher Langton langton2 at llnl.gov
Thu Mar 15 13:38:14 EDT 2012


On Mon, Mar 5, 2012 at 10:17 AM, Asher Langton <langton at gmail.com> wrote:
> This is a followup to my post from January
> (http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html)
> and the panel discussion at PyData this weekend. As a few people have
> suggested, a better approach than the MPI-broadcasted lookups is to
> cache the locations of all the modules found in sys.path.
> [...]
> I'll put an initial implementation of this importer on github sometime
> this week, and I'll follow up this post with some performance numbers
> when I have them.

Here are some numbers for the PEP302-based cached importer on an IBM
BlueGene/P machine. Numbers are wallclock measurements by the time
utility in minutes:seconds, one run for each test (not an average),
with no attempt to take into account other activity on the system or
fileservers. (With that said, I ran a variety of other tests, and the
results have been consistent.) I still need to run some larger tests,
particularly in the 16k-64k range, where Python imports start to scale
very poorly on this machine.

The tests use the code currently at github.com/langton/MPI_Import with
a script that simply imports 100 small C-extension modules.

With 1k cores/MPI processes:
cached_import.finder: 14:19.98
- skip actual import [1]: 13:37.77
- with checks [2]: 27:09.60
- w/checks, no import: 26:23.63

cached_import.mpi4py_finder [3]: 2:32.51
- skip actual import: 1:42.55
- with checks: 2:32.38
- w/checks, no import: 1:42.94

MPI_Import [4]: 2:22.20

standard import : 15:43.63
- skip actual imports [5]: 0:56.59


With 4k cores/MPI processes:
cached_import.finder: 27:34.45
- skip actual import: 27:40.58
- with checks: 52:14.83
- w/checks, no import: 50:04.73

cached_import.mpi4py_finder: 4:03.02
- skip actual import: 3:12.75
- with checks: 4:13.65
- w/checks, no import: 3:18.46

MPI_Import: 4:02.76

standard import : 35:24.77
- skip actual imports: 1:56.36

Notes:
[1] Builds the cache, but omits the actual imports.
[2] Check whether modules in sys.path are readable while building the
cache. Because filesystem operations are expensive, these checks are
off by default.
[3] Only the rank 0 process builds the initial cache, which is then
broadcasted over MPI.
[4] The other import replacement.
[5] This is roughly the interpreter startup/initialization time.


-Asher



More information about the NumPy-Discussion mailing list