[issue39801] list.insert is slow, likely due to manual memmove

Sat Feb 29 13:11:27 EST 2020

Raymond Hettinger <raymond.hettinger at gmail.com> added the comment:

Here's a diff you can try out:

diff --git a/Objects/listobject.c b/Objects/listobject.c
index 3c39c6444b..ac6c9cc07a 100644
--- a/Objects/listobject.c
+++ b/Objects/listobject.c
@@ -272,8 +272,7 @@ ins1(PyListObject *self, Py_ssize_t where, PyObject *v)
     if (where > n)
         where = n;
     items = self->ob_item;
-    for (i = n; --i >= where; )
-        items[i+1] = items[i];
+    memmove(items+where+1, items+where+0, (n - where) * sizeof(PyObject *));
     Py_INCREF(v);
     items[where] = v;
     return 0;

For me, that patch makes little difference in the timings.

Another thing that could be tried is to have ins1() call list_ass_slice() to do the work.  That way, both paths will use the same underlying insertion code.  It would be interesting to see if this makes any difference.

A few other thoughts:

* We usually prefer deques to lists when prepending because the former are O(1) and the latter are O(n).

* Modern compilers are very good at optimizing data movements in a simple for-loop.

* It is very difficult to get good timings for these kind of operations.  Internally, both make a call to list_resize() which in turn uses realloc().  The latter function can be very fast or very slow depending solely on whether it can resize in-place or has to move the data.  Accordingly, timings are easily confounded by minor changes to memory useage.  To get a clearcut timing, you would need to isolate just the for-loop or memmove operation and not include Python calling overhead, realloc operations, and everything else that is going on in the given timings.

* Another source of confounding is the compiler itself.  Changes that makes small improvements on Clang may hurt the GCC build or vice-versa.  The comparison can also be thrown-off by whether you're timing a PGO build (which tends to do an amazing job of optimizing for-foops).  The results may also differ between 32-bit builds and 64-builds.

* The timing isn't representative of how people use insert(x, 0).  It would be a mistake to optimize an uncommon case if it comes at the expense of the common case (insertions into small lists).  To guard against this, you would need to measure inserts into a small list to find-out whether memmove() has more setup time than the for-loop.  When we've done such investigations in the past, for-loops were found to beat memmove() for sizes under 16.  Almost certainly this changed as compilers and processors have changed over the years.

* For smaller list sizes, the calling overhead dominates the insertion time.  Disassembling the two different calls show a good deal of overhead for calls.

>>> from dis import dis
>>> dis('a.insert(0,0)')
  1           0 LOAD_NAME                0 (a)
              2 LOAD_METHOD              1 (insert)
              4 LOAD_CONST               0 (0)
              6 LOAD_CONST               0 (0)
              8 CALL_METHOD              2
             10 RETURN_VALUE
>>> dis('a[0:0]=[0]')
  1           0 LOAD_CONST               0 (0)
              2 BUILD_LIST               1
              4 LOAD_NAME                0 (a)
              6 LOAD_CONST               0 (0)
              8 LOAD_CONST               0 (0)
             10 BUILD_SLICE              2
             12 STORE_SUBSCR
             14 LOAD_CONST               1 (None)
             16 RETURN_VALUE

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue39801>
_______________________________________