[Python-checkins] bpo-46504: faster code for trial quotient in x_divrem() (GH-30856)

Mon Jan 24 20:06:10 EST 2022

https://github.com/python/cpython/commit/7c26472d09548905d8c158b26b6a2b12de6cdc32
commit: 7c26472d09548905d8c158b26b6a2b12de6cdc32
branch: main
author: Tim Peters <tim.peters at gmail.com>
committer: tim-one <tim.peters at gmail.com>
date: 2022-01-24T19:06:00-06:00
summary:

bpo-46504: faster code for trial quotient in x_divrem() (GH-30856)

* bpo-46504: faster code for trial quotient in x_divrem()

This brings x_divrem() back into synch with x_divrem1(), which was changed
in bpo-46406 to generate faster code to find machine-word division
quotients and remainders. Modern processors compute both with a single
machine instruction, but convincing C to exploit that requires writing
_less_ "clever" C code.

files:
M Objects/longobject.c

diff --git a/Objects/longobject.c b/Objects/longobject.c
index ee20e2638bcad..5f0cc579c2cca 100644
--- a/Objects/longobject.c
+++ b/Objects/longobject.c
@@ -2767,8 +2767,15 @@ x_divrem(PyLongObject *v1, PyLongObject *w1, PyLongObject **prem)
         vtop = vk[size_w];
         assert(vtop <= wm1);
         vv = ((twodigits)vtop << PyLong_SHIFT) | vk[size_w-1];
+        /* The code used to compute the remainder via
+         *     r = (digit)(vv - (twodigits)wm1 * q);
+         * and compilers generally generated code to do the * and -.
+         * But modern processors generally compute q and r with a single
+         * instruction, and modern optimizing compilers exploit that if we
+         * _don't_ try to optimize it.
+         */
         q = (digit)(vv / wm1);
-        r = (digit)(vv - (twodigits)wm1 * q); /* r = vv % wm1 */
+        r = (digit)(vv % wm1);
         while ((twodigits)wm2 * q > (((twodigits)r << PyLong_SHIFT)
                                      | vk[size_w-2])) {
             --q;