Debugging a memory leak

Pasha Stetsenko stpasha at gmail.com
Thu Oct 22 20:51:54 EDT 2020


Dear Python gurus,

I'm a maintainer of a python library "datatable" (can be installed from
PyPi), and i've been recently trying to debug a memory leak that occurs in
my library.
The program that exposes the leak is quite simple:
```
import datatable as dt
import gc  # just in case

def leak(n=10**7):
    for i in range(n):
        z = dt.update()

leak()
gc.collect()
input("Press enter")
```
Note that despite the name, the `dt.update` is actually a class, though it
is defined via Python C API. Thus, this script is expected to create and
then immediately destroy 10 million simple python objects.
The observed behavior, however,  is  that the script consumes more and more
memory, eventually ending up at about 500M. The amount of memory the
program ends up consuming is directly proportional to the parameter `n`.

The `gc.get_objects()` does not show any extra objects however. The
`tracemalloc` module shows that there are indeed `n` objects leaked in the
`z=dt.update()` line, but doesn't give any extra details.

In order to dig deeper, I let the process wait on the "input()" line, and
wrote a script to dump the process' memory into a file. Then I scanned
through the file looking at any repeated patterns of 64-bit words. Inside
the memory dump, the following sequences were the most common:
```
  0x0000000000000000 - 28660404
  0x00000001024be6e8 - 4999762
  0x0000000101cbdea0 - 119049
  0x0000000000000054 - 59537
  0x00000fd000000ff0 - 59526
  0x0000000000000001 - 16895
  0xffffffffffffffff - 12378
  ...
```
The most suspicious sequence here is 0x00000001024be6e8, which if you look
at that address with lldb, is the address of the PyTypeObject "dt.update",
which looks approximately like this:
```
(lldb) p *(PyTypeObject*)(0x000000010f4206e8)
(PyTypeObject) $0 = {
  ob_base = {
    ob_base = {
      ob_refcnt = 8
      ob_type = 0x000000010ec216b0
    }
    ob_size = 0
  }
  tp_name = 0x000000010f3a442c "datatable.update"
  tp_basicsize = 48
  tp_itemsize = 0
  tp_dealloc = 0x000000010f0a8040 (_datatable.cpython-36m-darwin.so`void
py::_safe_dealloc<py::oupdate::oupdate_pyobject,
&(py::oupdate::oupdate_pyobject::m__dealloc__())>(_object*) at
xobject.h:270)
  tp_print = 0x0000000000000000
  tp_getattr = 0x0000000000000000
  tp_setattr = 0x0000000000000000
  tp_as_async = 0x0000000000000000
  tp_repr = 0x000000010eab3fa0 (Python`object_repr)
  tp_as_number = 0x0000000000000000
  tp_as_sequence = 0x0000000000000000
  tp_as_mapping = 0x0000000000000000
  tp_hash = 0x000000010eb48640 (Python`_Py_HashPointer)
  tp_call = 0x0000000000000000
  tp_str = 0x000000010eab40d0 (Python`object_str)
  tp_getattro = 0x000000010eaa1ae0 (Python`PyObject_GenericGetAttr)
  tp_setattro = 0x000000010eaa1ce0 (Python`PyObject_GenericSetAttr)
  tp_as_buffer = 0x0000000000000000
  tp_flags = 266240
...
```
Thus, I can be quite certain that 0x1024be6e8 is the address of the
`dt.update` type structure.

The way this address appears in the memory dump looks like this:
```
0x7f97875cbb10: 0x00000000 0x00000000 0x024be6e8 0x00000001
0x7f97875cbb20: 0x00000000 0x00000000 0x00000000 0x00000000
0x7f97875cbb30: 0x00000000 0x00000000 0x00000000 0x00000000
0x7f97875cbb40: 0x00000000 0x00000000 0x024be6e8 0x00000001
0x7f97875cbb50: 0x00000000 0x00000000 0x00000000 0x00000000
0x7f97875cbb60: 0x00000000 0x00000000 0x00000000 0x00000000
0x7f97875cbb70: 0x00000000 0x00000000 0x024be6e8 0x00000001
0x7f97875cbb80: 0x00000000 0x00000000 0x00000000 0x00000000
0x7f97875cbb90: 0x00000000 0x00000000 0x00000000 0x00000000
0x7f97875cbba0: 0x00000000 0x00000000 0x024be6e8 0x00000001
0x7f97875cbbb0: 0x00000000 0x00000000 0x00000000 0x00000000
0x7f97875cbbc0: 0x00000000 0x00000000 0x00000000 0x00000000
```
If i guess that all these represent the leaked objects, then inspecting any
of them shows the following:
```
(lldb) p *(PyObject*)(0x7f97875cbb10)
(PyObject) $2 = {
  ob_refcnt = 0
  ob_type = 0x000000010f4206e8
}
```
Thus, the object didn't actually "leak" in the normal sense: its refcount
is 0 and it was reclaimed by the Python runtime (when i print a debug
message in tp_dealloc, i see that the destructor gets called every time).
Still, Python keeps requesting more and more memory from the system instead
of reusing the memory  that was supposed to be freed.

I'm currently not sure where to go from here. Is there something wrong with
my python object that prevents it from being correctly processed by the
Python runtime? Because this doesn't seem to be the usual case of
incrementing the refcount too many times.

This behavior was observed in Python 3.6.6 and also Python 3.8.0b2.


More information about the Python-list mailing list