[Numpy-discussion] parallel distutils extensions build? use gcc -flto

Wed Jul 16 18:20:40 EDT 2014

hi,
I have been playing around a bit with gccs link time optimization
feature and found that using it actually speeds up a from scratch build
of numpy due to its ability to perform parallel optimization and linking.
As a bonus you also should get faster binaries due to the better
optimizations lto allows.

As compiling with lto does require some possibly lesser know details I
wanted to share it.

Prerequesits are a working gcc toolchain of at least gcc-4.8 and
binutils > 2.21, gcc 4.9 is better as its faster.

First of all numpy checks the long double representation by compiling a
file and looking at the binary, this won't work as the od -b
reimplementation here does not understand lto objects, so on x86 we must
short circuit that:

--- a/numpy/core/setup_common.py
+++ b/numpy/core/setup_common.py
@@ -174,6 +174,7 @@ def check_long_double_representation(cmd):
     # We need to use _compile because we need the object filename
     src, object = cmd._compile(body, None, None, 'c')
     try:
+	return 'IEEE_DOUBLE_LE'
         type = long_double_representation(pyod(object))
         return type
     finally:


Next we build numpy as usual but override the compiler, linker and ar to
add our custom flags.
The setup.py call would look like this:

CC='gcc -fno-fat-lto-objects -flto=4 -fuse-linker-plugin -O3' \
LDSHARED='gcc -fno-fat-lto-objects -flto=4 -fuse-linker-plugin -shared
-O3' AR=gcc-ar \
python setup.py build_ext

Some explanation:
The ar override is needed as numpy builds a static library and ar needs
to know about lto objects. gcc-ar does exactly that.
-flto=4 the main flag tell gcc to perform link time optimizations using
4 parallel processes.
-fno-fat-lto-objects tells gcc to only build lto objects, normally it
builds both an lto object and a normal object for toolchain
compatibilty. If our toolchain can handle lto objects this is just a
waste of time and we skip it. (The flag is default in gcc-4.9 but not 4.8)
-fuse-linker-plugin directs it to run its link time optimizer plugin in
the linking step, the linker must support plugins, both bfd (> 2.21) and
gold linker do so. This allows for more optimizations.
-O3 has to be added to the linker too as thats where the optimization
occurs. In general a problem with lto is that the compiler options of
all steps much match the flags used for linking.

If you are using c++ or gfortran you also have to override that to use
lto (CXX and FF(?))

See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for a lot
more details.


For some numbers on my machine a from scratch numpy build with no
caching takes 1min55s, with lto on 4 it only takes 55s. Pretty neat for
a much more involved optimization process.

Concerning the speed gain we get by this, I ran our benchmark suite with
this build, there were no really significant gains which is somewhat
expected as numpy is simple C code with most function bottlenecks
already inlined.

So conclusion: flto seems to work well with recent gccs and allows for
faster builds using the limited distutils. While probably not useful for
development where compiler caching (ccache) is of utmost importance it
is still interesting for projects doing one shot uncached builds (travis
like CI) and have huge objects (e.g. swig or cython) and don't want to
change to proper parallel build systems like bento.

PS: So far I know clang also supports lto but I never used it
PPS: using NPY_SEPARATE_COMPILATION=0 crashes gcc-4.9, time for a bug
report.

Cheers,
Julian