[pypy-dev] LLVM next steps

Tue Sep 10 19:53:29 CEST 2013

On Sun, Sep 8, 2013 at 11:17 AM, Armin Rigo <arigo at tunes.org> wrote:
> Hi again,
>
> On Sun, Sep 8, 2013 at 9:42 AM, Armin Rigo <arigo at tunes.org> wrote:
>> We've been suitably impressed by the results on the new llvm backend
>> during the sprint (well, or suitably un-impressed by both gcc and
>> clang's failure to reconstruct the SSA meaning of the C code).
>
> I have investigated a bit more and it's quite unclear that this would
> be the source of the difference.  It seems that the "-flto" option of
> gcc, enabling link-time optimization, actually gives very good
> improvements over the same compilation without this option --- some
> 11-14%, more so than, say, the typical 5% reported with CPython.  If I
> had to guess, I'd say it is because of the particularly disorganized
> kind of C code produced by RPyhon.
>
> About the llvm backend, one detail hints that it might be the reason
> for the speed improvement: the fact that the current llvm backend
> produces most of the source code in a single file.  This may be what
> gives llvm extra room for improvements.  This is precisely the same
> room for improvement that "-flto" also gives gcc, considering that we
> generate many C files with never-"static" functions.
>
> I tried to compile a no-jit version of PyPy from the
> llvm-translation-backend branch, for comparison, but this fails right
> now with "NotImplementedError: v585190 = debug_offset()".  It
> successfully compiles targetrpystonedalone (in -O2 mode), though.  I
> get the following results (with the argument "100000000"):
>
>     plain gcc 4.7.3:  1.95 seconds
>     llvm 3.3:  1.75 seconds
>     gcc with -flto:  1.66 seconds
>
> If we get similar results on the whole PyPy, then I fear the llvm
> backend is going back to where it already went to several time: "not
> useful enough".  We can simply add the -flto flag to the generated
> Makefiles.  Manuel, do you feel like trying to compare?  I'm modifying
> the Makefile manually as follows:
>
>     CFLAGS = ......  -flto -fno-fat-lto-objects
>     LDFLAGS = .....  -flto=8 -O3

The type of machine-generated code produced PyPy is difficult for
compilers to optimize (lots of seemingly unstructured gotos, state
machines, unusual basic block heuristics) when presented in a
high-level langauge like C.  The distribution of the source code
across a large number of source files also complicates the
optimization process.

GCC and LLVM link-time optimization can overcome some of these
problems by allowing the compiler to "see" more of the program and
optimize across the source files.  Directly generating LLVM IR
accomplishes a similar benefit.  With some of the recent changes to
GCC, one also directly could generate GCC IR.

LLVM makes it very convenient to directly input the IR and take
advantage of optimization opportunities allowed by such an input
method, but the performance benefit is not likely due to other
difference in optimization pipelines and code generation capabilities.

In addition to the GCC -flto option, you should consider if
-fwhole-program also is appropriate (I believe that it is).

GCC has additional optimizations that can help with the style of code
generated by programs like PyPy. PyPy does not generate code with
computed gotos, but the aggressive use of gotos are different than
normal user-written code and probably can benefit from non-default
compiler optimization heuristics. There is no obvious recommendation,
but experiments with enabling / disabling some forms of GCSE (-fgcse,
-fgcse-lm, -fgcse-sm, -fgcse-las, -fgcse-after-reload) as well as some
of the parameters (crossjumping, goto-duplication, inlining limits)
might benefit PyPy.

One can achieve performance gains with either compiler through
adjustments to the generated code and the compiler optimization
heuristics.

Thanks, David