[Numpy-discussion] automatically avoiding temporary arrays

Sat Oct 1 14:38:16 EDT 2016

Julian,

This is really, really cool!

I have been wanting something like this for years (over a decade? wow!),
but always thought it would require hacking the interpreter to intercept
operations. This is a really inspired idea, and could buy numpy a lot of
performance.

I'm afraid I can't say much about the implementation details -- but great
work!

-Chris

On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:

> On 30.09.2016 23:09, josef.pktd at gmail.com wrote:
> > On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
> > <jtaylor.debian at googlemail.com> wrote:
> >> hi,
> >> Temporary arrays generated in expressions are expensive as the imply
> >> extra memory bandwidth which is the bottleneck in most numpy operations.
> >> For example:
> >>
> >> r = a + b + c
> >>
> >> creates the b + c temporary and then adds a to it.
> >> This can be rewritten to be more efficient using inplace operations:
> >>
> >> r = b + c
> >> r += a
> >
> > general question (I wouldn't understand the details even if I looked.)
> >
> > how is this affected by broadcasting and type promotion?
> >
> > Some of the main reasons that I don't like to use inplace operation in
> > general is that I'm often not sure when type promotion occurs and when
> > arrays expand during broadcasting.
> >
> > for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
> > another case when I switch away from broadcasting is when b + c is int
> > or bool and a is float. Thankfully, we get error messages for casting
> > now.
>
> the temporary is only avoided when the casting follows the safe rule, so
> it should be the same as what you get without inplace operations. E.g.
> float32-temporary + float64 will not be converted to the unsafe float32
> += float64 which a normal inplace operations would allow. But
> float64-temp + float32 is transformed.
>
> Currently the only broadcasting that will be transformed is temporary +
> scalar value, otherwise it will only work on matching array sizes.
> Though there is not really anything that prevents full broadcasting but
> its not implemented yet in the PR.
>
> >
> >>
> >> This saves some memory bandwidth and can speedup the operation by 50%
> >> for very large arrays or even more if the inplace operation allows it to
> >> be completed completely in the cpu cache.
> >
> > I didn't realize the difference can be so large. That would make
> > streamlining some code worth the effort.
> >
> > Josef
> >
> >
> >>
> >> The problem is that inplace operations are a lot less readable so they
> >> are often only used in well optimized code. But due to pythons
> >> refcounting semantics we can actually do some inplace conversions
> >> transparently.
> >> If an operand in python has a reference count of one it must be a
> >> temporary so we can use it as the destination array. CPython itself does
> >> this optimization for string concatenations.
> >>
> >> In numpy we have the issue that we can be called from the C-API directly
> >> where the reference count may be one for other reasons.
> >> To solve this we can check the backtrace until the python frame
> >> evaluation function. If there are only numpy and python functions in
> >> between that and our entry point we should be able to elide the
> temporary.
> >>
> >> This PR implements this:
> >> https://github.com/numpy/numpy/pull/7997
> >>
> >> It currently only supports Linux with glibc (which has reliable
> >> backtraces via unwinding) and maybe MacOS depending on how good their
> >> backtrace is. On windows the backtrace APIs are different and I don't
> >> know them but in theory it could also be done there.
> >>
> >> A problem is that checking the backtrace is quite expensive, so should
> >> only be enabled when the involved arrays are large enough for it to be
> >> worthwhile. In my testing this seems to be around 180-300KiB sized
> >> arrays, basically where they start spilling out of the CPU L2 cache.
> >>
> >> I made a little crappy benchmark script to test this cutoff in this
> branch:
> >> https://github.com/juliantaylor/numpy/tree/elide-bench
> >>
> >> If you are interested you can run it with:
> >> python setup.py build_ext -j 4 --inplace
> >> ipython --profile=null check.ipy
> >>
> >> At the end it will plot the ratio between elided and non-elided runtime.
> >> It should get larger than one around 180KiB on most cpus.
> >>
> >> If no one points out some flaw in the approach, I'm hoping to get this
> >> into the next numpy version.
> >>
> >> cheers,
> >> Julian
> >>
> >>
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion at scipy.org
> >> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20161001/3f0c9d89/attachment.html>