[issue45116] Performance regression 3.10b1 and later on Windows: Py_DECREF() not inlined in PGO build

neonene report at bugs.python.org
Mon Sep 13 19:37:47 EDT 2021


neonene <nicesalmon at gmail.com> added the comment:

With msvc 16.10.3 and 16.11.2 (latest),
PR25244 told me the amount of code in _PyEval_EvalFrameDefault() is over the limit of PGO.
In the old version of _PyEval_EvalFrameDefault (b98eba5), the same issue can be caused adding any-code anywhere with more than 20 expressions/statements. For example, at the top/middle/end of the function, repeating "if (0) {}" 10times, or "if (0) {19 statements}". As for python3.9.7, more than 800 expressions/statements.

Here is just a workaround for 3.10rc2 on windows.
==================================================
--- Python/ceval.c
+++ Python/ceval.c
@@ -1306,9 +1306 @@
-#define DISPATCH() \
-    { \
-        if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \
-            goto tracing_dispatch; \
-        } \
-        f->f_lasti = INSTR_OFFSET(); \
-        NEXTOPARG(); \
-        DISPATCH_GOTO(); \
-    }
+#define DISPATCH() goto tracing_dispatch
@@ -1782,4 +1774,9 @@
     tracing_dispatch:
     {
+        if (!(trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE)) {
+            f->f_lasti = INSTR_OFFSET();
+            NEXTOPARG();
+            DISPATCH_GOTO();
+        }
         int instr_prev = f->f_lasti;
         f->f_lasti = INSTR_OFFSET();
==================================================

This patch becomes ineffective just adding one expression to DISPATCH macro as below

   #define DISPATCH() {if (1) goto tracing_dispatch;}

And this approach is not sufficient for 3.11 with bigger eval-func.
I don't know a cl/link option to lift such restriction of function size.


3.10rc2 x86 pgo : 1.00
        patched : 1.09x faster (slower  5, faster 48, not significant 5)

3.10rc2 x64 pgo : 1.00         (roughly the same speed as official bin)
        patched : 1.07x faster (slower  5, faster 47, not significant 6)
  patched(/Ob3) : 1.07x faster (slower  7, faster 45, not significant 6)

x64 results are posted.

Fixing inlining rejection also made __forceinline buildable with normal processing time and memory usage.

----------
Added file: https://bugs.python.org/file50280/310rc2_benchmarks.txt

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45116>
_______________________________________


More information about the Python-bugs-list mailing list