[Numpy-discussion] How to debug reference counting errors

Ondřej Čertík ondrej.certik at gmail.com
Fri Aug 31 21:05:32 EDT 2012


On Fri, Aug 31, 2012 at 5:56 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> On Fri, Aug 31, 2012 at 5:35 PM, Ondřej Čertík <ondrej.certik at gmail.com>
> wrote:
>>
>> Hi Dag,
>>
>> On Fri, Aug 31, 2012 at 4:22 AM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no> wrote:
>> > On 08/31/2012 09:03 AM, Ondřej Čertík wrote:
>> >> Hi,
>> >>
>> >> There is segfault reported here:
>> >>
>> >> http://projects.scipy.org/numpy/ticket/1588
>> >>
>> >> I've managed to isolate the problem and even provide a simple patch,
>> >> that fixes it here:
>> >>
>> >> https://github.com/numpy/numpy/issues/398
>> >>
>> >> however the patch simply doesn't decrease the proper reference, so it
>> >> might leak. I've used
>> >> bisection (took the whole evening unfortunately...) but the good news
>> >> is that I've isolated commits
>> >> that actually broke it. See the github issue #398 for details, diffs
>> >> etc.
>> >>
>> >> Unfortunately, it's 12 commits from Mark and the individual commits
>> >> raise exception on the segfaulting code,
>> >> so I can't pin point the problem further.
>> >>
>> >> In general, how can I debug this sort of problem? I tried to use
>> >> valgrind, with a debugging build of numpy,
>> >> but it provides tons of false (?) positives:
>> >> https://gist.github.com/3549063
>> >>
>> >> Mark, by looking at the changes that broke it, as well as at my "fix",
>> >> do you see where the problem could be?
>> >>
>> >> I suspect it is something with the changes in PyArray_FromAny() or
>> >> PyArray_FromArray() in ctors.c.
>> >> But I don't see anything so far that could cause it.
>> >>
>> >> Thanks for any help. This is one of the issues blocking the 1.7.0
>> >> release.
>> >
>> > IIRC you can recompile Python with some support for detecting memory
>> > leaks. One of the issues with using Valgrind, after suppressing the
>> > false positives, is that Python uses its own memory allocator so that
>> > sits between the bug and what Valgrind detects. So at least recompile
>> > Python to not do that.
>>
>> Right. Compiling with "--without-pymalloc" (per README.valgrind as
>> suggested
>> above by Richard) should improve things a lot. Thanks for the tip.
>>
>> >
>> > As for hardening the NumPy source in general, you should at least be
>> > aware of these two options:
>> >
>> > 1) David Malcolm (dmalcolm at redhat.com) was writing a static code
>> > analysis plugin for gcc that would check every routine that the
>> > reference count semantics was correct. (I don't know how far he's got
>> > with that.)
>> >
>> > 2) In Cython we have a "reference count nanny". This requires changes to
>> > all the code though, so not an option just for finding this bug, just
>> > thought I'd mention it. In addition to the INCREF/DECREF you need to
>> > insert new "GIVEREF" and "GOTREF" calls (which are noops in a normal
>> > compile) to declare where you get and give away a reference. When
>> > Cython-generated sources are enabled with -DCYTHON_REFNANNY,
>> > INCREF/DECREF/GIVEREF/GOTREF are tracked within each function and a
>> > failure is raised if the function violates any contract.
>>
>> I see. That's a nice option. For my own code, I never touch the
>> reference counting
>> by hand and rather just use Cython.
>>
>>
>> In the meantime, Mark fixed it:
>>
>> https://github.com/numpy/numpy/pull/400
>> https://github.com/numpy/numpy/pull/405
>>
>> Mark, thanks again for this. That saved me a lot of time.
>
>
> No problem. The way I prefer to deal with this kind of error is use C++
> smart pointers. C++11's unique_ptr and boost's intrusive_ptr are both useful
> for painlessly managing this kind of reference counting headache.

Oh yes. I prefer to use Trilinos' RCP, which is a shared pointer (just
like in C++11), but has better debugging info if something goes wrong.
It can be compiled in two modes -- one is slower and it can't
segfault, and the other is optimized, most operations are at native
raw pointer speed, but it can segfault.

Ondrej



More information about the NumPy-Discussion mailing list