[Numpy-discussion] Tensor Contraction (HPTT) and Tensor Transposition (TCL)

Wed Aug 16 18:33:27 EDT 2017

Am 8/16/17 um 6:08 PM schrieb Anne Archibald:
> (NB all: the thread title seems to interchange the acronyms for the
> Thread Contraction Library (TCL) and the High-Perormance Tensor
> Transpose (HPTT) packages. I'm not fixing it so as not to break
> threading.) 
>
> On Wed, Aug 16, 2017 at 11:40 AM Paul Springer <pavdev at gmx.de
> <mailto:pavdev at gmx.de>> wrote:
>
>>     If you want to get it into Numpy, it would be worth checking if
>>     the existing functions can be improved before adding new ones.
>>
>>     Note that Numpy transposition method just rearranges the indices,
>>     so the advantage of actual transposition is to have better cache
>>     performance or allow direct use of CBLAS. I assume TCL uses some
>>     tricks to do transposition in a way that is more cache friendly?
>     HPTT is a sophisticated library for tensor transpositions, as such
>     it blocks the tensors such that (1) spatial locality can be
>     exploited. Moreover, (2) it uses explicit vectorization to take
>     advantage of the CPU's vector units.
>
>
> I think this library provides functionality that isn't readily
> accessible from within numpy at the moment. The only functions I know
> of to rearrange the memory layout of data are things like
> ascontiguousarray and asfortranarray, as well as assignment (e.g.
> a[...] = b). The general strategy within numpy is to assume that all
> functions work equally well on arrays with arbitrary memory layouts,
> so that users often don't even know the memory layouts of their data.
> The striding functionality means data usually doesn't actually get
> transposed until absolutely necessary.
>
> Of course, few if any numpy functions work equally well on different
> memory layouts; unary ufuncs contain code to try to carry out their
> iteration in the fastest way, but it's not clear how well that works
> or whether they have the freedom to choose the layouts of their output
> arrays. 
>
> If you wanted to integrate HPTT into numpy, I think the best approach
> might be to wire it into the assignment machinery, so that when users
> do things like a[::2,:] = b[:,::3].T HPTT springs into action behind
> the scenes and makes this assignment as efficient as possible (how
> well does it handle arrays with spaces between elements?). Then
> ascontiguousarray and asfortranarray and the like could simply use
> assignment to an appropriately-ordered destination when they actually
> needed to do anything.
HPTT offers support for subtensor (via the outerSize parameter, which is
similar to the leading dimension in BLAS), thus, HPTT can also deal with
arbitrarily strided transpositions.
However, a non-unite stride for the fastest-varying index is devastating
for performance since this prohibits the use of vectorization and the
exploitation of spatial locality.

How would the integration of HPTT into NumPY look like?
Which steps would need to be taken?
Would it be required the HPTT be distributed in source code along side
NumPY (at that point I might have to change the license for HPTT) or
would it be fine to add an git dependency? That way users who build
NumPY from source could fetch HPTT and set a flag during the build
process of NumPY, indicating the HPTT is available?
How would the process look like if NumPY is distributed as a precompiled
binary?

The same questions apply with respect to TCL.
>
>     TCL uses the Transpose-Transpose-GEMM-Transpose approach where all
>     tensors are flattened into matrices (via HPTT) and then contracted
>     via GEMM; the final result is eventually folded (via HPTT) into
>     the desired output tensor.
>
>
> This is a pretty direct replacement of einsum, but I think einsum may
> well already do pretty much this, apart from not using HPTT to do the
> transposes. So the way to get this functionality would be to make the
> matrix-rearrangement primitives use HPTT, as above.
That would certainly be one approach, however, TCL also explores several
different strategies/candidates and picks the one that minimizes the
data movements required by the transpositions.
>
>     Would it be possible to expose HPTT and TCL as optional packages
>     within NumPY? This way I don't have to redo the work that I've
>     already put into those libraries.
>
>
> I think numpy should be regarded as a basically-complete package for
> manipulating strided in-memory data, to which we are reluctant to add
> new user-visible functionality. Tools that can act under the hood to
> make existing code faster, or to reduce the work users must to to make
> their code run fast enough, are valuable.
It seems to me that TCL is such a candidate, since it can replace a
significant portion of the functionality offered by numpy.einsum(),
yielding significantly higher performance.

I imagine some thing of the form:

def einsum(...):
    if( tclApplicable and tclAvailable ):
       tcl.tensorMult(...)
>  
> Would some of your techniques help numpy to more rapidly evaluate
> things like C[...] = A+B, when A B and C are arbitrarily strided and
> there are no ordering constraints on the result? Or just A+B where
> numpy is free to choose the memory layout for the result?
Actually, HPTT is only concerned with the operation of the form
B[perm(i0,i1,...)] = alpha * A[i0,i1,...] + beta * B[perm(i0,i1,...)]
(where alpha and beta are scalars). Summing over multiple transposed
tensors can be quite challenging (https://arxiv.org/abs/1705.06661) and
is not covered by HPTT. Does this answer your question?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170817/9b447a82/attachment-0001.html>