[Numpy-discussion] "import numpy" performance

Mon Jul 2 15:17:58 EDT 2012

In this email I propose a few changes which I think are minor
and which don't really affect the external NumPy API but which
I think could improve the "import numpy" performance by at
least 40%. This affects me because I and my clients use a
chemistry toolkit which uses only NumPy arrays, and where
we run short programs often on the command-line.

In July of 2008 I started a thread about how "import numpy"
was noticeably slow for one of my customers. They had
chemical analysis software, often even run on a single
molecular structure using command-line tools, and the
several invocations with 0.1 seconds overhead was one of
the dominant costs even when numpy wasn't needed.

I fixed most of their problems by deferring numpy imports
until needed. I remember well the Steve Jobs anecdote at
  http://folklore.org/StoryView.py?project=Macintosh&story=Saving_Lives.txt
and spent another day of my time in 2008 to identify the
parts of the numpy import sequence which seemed excessive.
I managed to get the import time down from 0.21 seconds to
0.08 seconds.

Very little of that made it into NumPy.

The three biggest changes I would like are:

1) remove "add_newdocs" and put the docstrings in the C code
 'add_newdocs' still needs to be there,

The code says:

# This is only meant to add docs to objects defined in C-extension modules.
# The purpose is to allow easier editing of the docstrings without
# requiring a re-compile.

However, the change log shows that there are relatively few commits
to this module

  Year    Number of commits
  ====    =================
  2012       8
  2011      62
  2010       9
  2009      18
  2008      17

so I propose moving the docstrings to the C code, and perhaps
leaving 'add_newdocs' there, but only used when testing new
docstrings.

2) Don't optimistically assume that all submodules are
needed. For example, some current code uses

>>> import numpy
>>> numpy.fft.ifft
<function ifft at 0x10199f578>

(See a real-world example at
  http://stackoverflow.com/questions/10222812/python-numpy-fft-and-inverse-fft
)

IMO, this optimizes the needs of the interactive shell
NumPy author over the needs of the many-fold more people
who don't spend their time in the REPL and/or don't need
those extra features added to every NumPy startup. Please
bear in mind that NumPy users of the first category will
be active on the mailing list, go to SciPy conferences,
etc. while members of the second category are less visible.

I recognize that this is backwards incompatible, and will
not change. However, I understand that "NumPy 2.0" is a
glimmer in the future, which might be a natural place for
a transition to the more standard Python style of

  from numpy import fft

Personally, I think the documentation now (if it doesn't
already) should transition to use this form.

3) Especially: don't always import 'numpy.testing'

As far as I can tell, automatic import of this module
is not needed, so is pure overhead for the vast majority
of NumPy users. Unfortunately, there's a large number
of user-facing 'test' and 'bench' bound methods acting
as functions.

from numpy.testing import Tester
test = Tester().test
bench = Tester().test

They seem rather pointless to me but could be replaced
with per-module functions like

def test(...):
   from numpy.testing import Tester
   Tester().test(...)

I have not worried about numpy import performance for
4 years. While I have been developing scientific software
for 20 years, and in Python for 15 years, it has been
in areas of biology and chemistry which don't use arrays.
I use numpy for a day about once every two years, and
so far I have had no reason to use scipy.

This has changed.

I talked with one of my clients last week. They (and I)
use a chemistry toolkit called "RDKit". RDKit uses
numpy as a way to store coordinate data for molecules.
I checked with the package author and he confirms:

  yeah, it's just using the homogenous array most of the time.

My client complained about RDKit's high startup cost,
due to the NumPy dependency. On my laptop, with a warm
disk cache, it take 0.119s to "import rdkit". On a cold
cache it can take 3 seconds. On their cluster filesytem,
with a cold cache, it can take over 10 seconds.

(I told them about zipimport. They will be looking into
that as a solution. However, it doesn't easily help the other
people who use the RDKit toolkit.)

With instrumentation I found that 0.083s of the 0.119s
is spent loading numpy.core.multiarray. The slowest module
import times are listed here, with the cumulative time for
each module and the name of the (first) importing parent
in parentheses:

0.119 rdkit
0.089 rdchem (pyPgSQL)
0.083 numpy.core.multiarray (rdchem)
0.038 add_newdocs (numpy.core.multiarray)
0.032 numpy.lib (add_newdocs)
0.023 type_check (numpy.lib)
0.023 numpy.core.numeric (type_check)
0.012 numpy.testing (numpy.core.numeric)
0.010 unittest (numpy.testing)
0.008 cDataStructs (pyPgSQL)
0.007 random (numpy.core.multiarray)
0.007 mtrand (random)
0.006 case (unittest)
0.005 rdmolfiles (pyPgSQL)
0.005 rdmolops (pyPgSQL)
0.005 difflib (case)
0.005 chebyshev (numpy.core.multiarray)
0.004 hermite (numpy.core.multiarray)
0.004 hermite_e (numpy.core.multiarray)
0.004 laguerre (numpy.core.multiarray)

These timing were on a MacBook Pro I bought this year, using
<module 'numpy' from '/Library/Python/2.7/site-packages/numpy-1.6.1-py2.7-macosx-10.7-intel.egg/numpy/__init__.pyc'>

The minimal cheminformatics program is

% time python -c 'from rdkit import Chem; print Chem.MolToSmiles(Chem.MolFromSmiles("OCC"))'
CCO
0.126u 0.035s 0:00.16 93.7%	0+0k 0+0io 0pf+0w

(This chemical structure doesn't contain coordinates,
so numpy is pure overhead. However, other formats do
contain 2D or 3D coordinates. None need hermite or
other polynomials, which together are 10% of the
wall-clock time.)

With a hot disk cache, 1/2 of the time is spent importing.
With a cold cache, this is worse. (Eg, trying again now, it
takes 1.955 seconds to "import rdkit." Trying much later,
it takes 1.17 seconds to "import numpy". My web browser
windows and tabs have filled most of memory.)

Real code is of course more complex than this trivial
bit of code. But on the other hand, the typical
development cycle is to write the code to work for one
compound, get that working, and then run it on thousands
of compounds. The typical algorithm will be less than 1
second long, so during early development it's obvious that
much of the run-time is dominated by the import time
startup.

My hope is to get the single-compound time down to
under 0.1 seconds, or about 40% faster than it is now.
Below the 0.1 second threshold, human factors studies
show that people consider the time to be "instantaneous."

I do not think I'll be able to shave of the full 0.06s
which I want, but I can get close. If I can figure out
how to get rid of add_newdocs (0.038s) and numpy.testing
(0.012s) then I'll end up removing 0.05s. If I can remove
automatic inclusion of the polynomial modules then I'm
well into my goal.

What can be done to make these changes to NumPy? What
are the objections to my providing an updated set of
patches removing add_newdocs and numpy.testing ? 

Cheers,

				Andrew
				dalke at dalkescientific.com