[Python-Dev] Optimization targets (was: String hash function multiplier)

Wed Apr 14 17:49:45 EDT 2004

Hi,

Raymond wrote:
> It looks like the best bet is to try to speedup the code without
> changing the multiplier.

Indeed. And while you are at it, there are other optimizations, that
seem more promising:

I compiled a recent CVS Python with profiling and here is a list of the
top CPU hogs (on a Pentium III, your mileage may vary):

pystone:

  CPU%  Function Name
----------------------------
 55.44  eval_frame
  7.30  lookdict_string
  4.34  PyFrame_New
  3.73  frame_dealloc
  1.73  vgetargs1
  1.65  PyDict_SetItem
  1.42  string_richcompare
  1.15  PyObject_GC_UnTrack
  1.11  PyObject_RichCompare
  1.08  PyInt_FromLong
  1.08  tupledealloc
  1.04  insertdict

parrotbench:

  CPU%  Function Name
----------------------------
 23.65  eval_frame
  8.68  l_divmod
  4.43  lookdict_string
  2.95  k_mul
  2.27  PyType_IsSubtype
  2.23  PyObject_Malloc
  2.09  x_add
  2.05  PyObject_Free
  2.05  tupledealloc

Arguably parrotbench is a bit unrepresentative here. And beware: due to
automatic inlining of static functions the real offender may be hidden
(x_divmod is the hog, not l_divmod).

Anyway, this just confirms that the most important optimization targets are:
1. eval_frame
2. string keyed dictionaries
3. frame handling

I think 3. needs an algorithmic approach (there was some discussion about
it a few weeks ago), while 1. and 2. look like a code generation issue.

So I took a look at the machine code that GCC generates on x86 with -O3
for these: uh oh, not a pretty sight. The main problems are bad branch
predictions and lack of inlining.

About branch predictions:

The worst offender is the main code path for eval_frame: it gets split
across half a dozen segments. The code path for the non-argument bytecodes
is in fact the slowest.

Similarly the code for lookdict_string branches around like crazy. The
most likely code path is not the fastest path.

One solution is to use likely()/unlikely() macros (using __builtin_expect).
This is a good solution for small, isolated and performance critical code.
It does not work well for eval_frame, though (I tried).

But GCC has more to offer: read the man page entries for -fprofile-arcs
and -fbranch-probabilities. Here is a short recipe:

Go to your Python source directory and do this:

$ mkdir build_profile
$ cd build_profile
$ ../configure         # add any options you may need
$ make
$ mv python python_orig

Edit the Makefile and add -fprofile-arcs to OPT.

$ rm Python/ceval.o
$ make
$ mv python python_profile

Run your favourite benchmark(s), but only invoke python *once*:

$ ./python_profile -c 'import test.pystone; test.pystone.main(loops=100000)'

Forget about the performance numbers the benchmark reports. Never use
an executable compiled with profiling for comparison. But ... you should
now have a new (binary) file called Python/ceval.da that contains the
profiled branch probabilities.

Edit the Makefile and replace -fprofile-arcs with -fbranch-probabilities

$ rm Python/ceval.o
$ make
$ mv python python_opt

Then compare the benchmarks:

$ ./python_orig -c 'import test.pystone; test.pystone.main(loops=100000)'
$ ./python_opt -c 'import test.pystone; test.pystone.main(loops=100000)'

On my machine I get a 10-15% speedup. But we only optimized ceval.c ...

So repeat the above steps, but now delete Python/ceval.o and Objects/*.o
each time. Now I get about 20-25% speedup for pystone and 10% speedup for
parrotbench! Not bad, huh?

Now the bad news: I don't know how to integrate this into the regular
build process. So this is not an option for everyone, but ambitious
packagers might want to take the trouble to do this by hand.

Oh and of course neither pystone nor parrotbench are representative
of the branch probabilities of any particular application. But for
predicting eval_frame and lookdict_string, they are probably good enough.

On a related note: GCC uses random branch probabilities with -O, when
no probability information is present (no __builtin_expect or *.da files).
Use -fno-guess-branch-probability if you want predictable timings on
recompiles.

About inlining:

- The opcode decoding with NEXTOP/HAS_ARG/NEXTARG does not compile well.
  GCC has trouble inferring the lifetimes for opcode and oparg, too.
  Ideas:
  - Use a special inline assembly variant for x86.
  OR
  - Move NEXTARG to each opcode that needs it and make oparg a local block
    variable.
  - Use NEXTOP directly for the switch and refetch opcode into a local
    block variable where necessary.

- The Py_Ticker check and the tracing support should be moved out of the
  core loop. The Py_Ticker check may be put into a subroutine and called
  from selected opcodes only (haven't checked if that works). I have no
  (good) idea about how to get rid of the tracing support, though.

- _PyString_Eq() is a good candidate to inline for lookdict_string().
  I.e. a new macro in stringobject.h. But wait ...

- GCC generates pretty slow code for _PyString_Eq() (at least on x86).
  This is mainly due to a bad branch prediction (we should use likely())
  but also due to the penalty for 8-bit partial register load/compare
  generated by the ob_sval dereferences. The latter is completely useless
  because memcmp() is inlined (with repz cmpsb) and duplicates that
  comparison. Omitting the *ob_sval comparison is faster.

- Create a macro that inlines the first part of PyObject_Hash().
  Useful for PyDict_SetItem() and others.

I bet there are more, but I'm running out of time right now -- sorry.

Bye,
     Mike