[SciPy-Dev] RFC: comments to BLAS committee from numpy/scipy devs

Tyler Reddy tyler.je.reddy at gmail.com
Tue Jan 9 15:53:10 EST 2018


One common issue in computational geometry is the need to operate rapidly
on arrays with "heterogeneous shapes."

So, an array that has rows with different numbers of columns -- shape (1,3)
for the first polygon and shape (1, 12) for the second polygon and so on.

This seems like a particularly nasty scenario when the loss of
"homogeneity" in shape precludes traditional vectorization -- I think numpy
effectively converts these to dtype=object, etc. I don't
think is necessarily a BLAS issue since wrapping comp. geo. libraries does
happen in a subset of cases to handle this, but if there's overlap in
utility you could pass it along I suppose.



On 9 January 2018 at 04:40, Ilhan Polat <ilhanpolat at gmail.com> wrote:

> I couldn't find an item to place this but I think ilaenv and also calling
> the function twice (one with lwork=-1 and reading the optimal block size
> and the call the function again properly with lwork=<result>) in LAPACK
> needs to be gotten rid of.
>
> That's a major annoyance during the wrapping of LAPACK routines for SciPy.
>
> I don't know if this is realistic but the values ilaenv needed can be
> computed once (or again if hardware is changed) at the install and can be
> read off by the routines.
>
>
>
> On Jan 9, 2018 09:25, "Nathaniel Smith" <njs at pobox.com> wrote:
>
>> Hi all,
>>
>> As mentioned earlier [1][2], there's work underway to revise and
>> update the BLAS standard -- e.g. we might get support for strided
>> arrays and lose xerbla! There's a draft at [3]. They're interested in
>> feedback from users, so I've written up a first draft of comments
>> about what we would like as NumPy/SciPy developers. This is very much
>> a first attempt -- I know we have lots of people who are more expert
>> on BLAS than me on these lists :-). Please let me know what you think.
>>
>> -n
>>
>> [1] https://mail.python.org/pipermail/numpy-discussion/2017-
>> November/077420.html
>> [2] https://mail.python.org/pipermail/scipy-dev/2017-November/022267.html
>> [3] https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBT
>> TTVdBDvtD5I14QHp9OE/edit
>>
>> -----
>>
>> # Comments from NumPy / SciPy developers on "A Proposal for a
>> Next-Generation BLAS"
>>
>> These are comments on [A Proposal for a Next-Generation
>> BLAS](https://docs.google.com/document/d/1DY4ImZT1coqri2382G
>> usXgBTTTVdBDvtD5I14QHp9OE/edit#)
>> (version as of 2017-12-13), from the perspective of the developers of
>> the NumPy and SciPy libraries. We hope this feedback is useful, and
>> welcome further discussion.
>>
>> ## Who are we?
>>
>> NumPy and SciPy are the two foundational libraries of the Python
>> numerical ecosystem, and one of their duties is to wrap BLAS and
>> expose it for the use of other Python libraries. (NumPy primarily
>> provides a GEMM wrapper, while SciPy exposes more specialized
>> operations.) It's unclear how many users we have exactly, but we
>> certainly ship multiple million copies of BLAS every month, and
>> provide one of the most popular numerical toolkits for both novice and
>> expert users.
>>
>> Looking at the original BLAS and LAPACK interfaces, it often seems
>> that their imagined user is something like a classic supercomputer
>> consumer, who writes code directly in Fortran or C against the BLAS
>> API, and where the person writing the code and running the code are
>> the same. NumPy/SciPy are coming from a very different perspective:
>> our users generally know nothing about the details of the underlying
>> BLAS; they just want to describe their problem in some high-level way,
>> and the library is responsible for making it happen as efficiently as
>> possible, and is often integrated into some larger system (e.g. a
>> real-time analytics platform embedded in a web server).
>>
>> When it comes to our BLAS usage, we mostly use only a small subset of
>> the routines. However, as "consumer software" used by a wide variety
>> of users with differing degress of technical expertise, we're expected
>> to Just Work on a wide variety of systems, and with as many different
>> vendor BLAS libraries as possible. On the other hand, the fact that
>> we're working with Python means we don't tend to worry about small
>> inefficiencies that will be lost in the noise in any case, and are
>> willing to sacrifice some performance to get more reliable operation
>> across our diverse userbase.
>>
>> ## Comments on specific aspects of the proposal
>>
>> ### Data Layout
>>
>> We are **strongly in favor** of the proposal to support arbitrary
>> strided data layouts. Ideally, this would support strides *specified
>> in bytes* (allowing for unaligned data layouts), and allow for truly
>> arbitrary strides, including *zero or negative* values. However, we
>> think it's fine if some of the weirder cases suffer a performance
>> penalty.
>>
>> Rationale: NumPy – and thus, most of the scientific Python ecosystem –
>> only has one way of representing an array: the `numpy.ndarray` type,
>> which is an arbitrary dimensional tensor with arbitrary strides. It is
>> common to encounter matrices with non-trivial strides. For example::
>>
>>     # Make a 3-dimensional tensor, 10 x 9 x 8
>>     t = np.zeros((10, 9, 8))
>>     # Considering this as a stack of eight 10x9 matrices, extract the
>> first:
>>     mat = t[:, :, 0]
>>
>> Now `mat` has non-trivial strides on both axes. (If running this in a
>> Python interpreter, you can see this by looking at the value of
>> `mat.strides`.) Another case where interesting strides arise is when
>> performing ["broadcasting"](https://docs.scipy.org/doc/numpy-1.13.0/use
>> r/basics.broadcasting.html),
>> which is the name for NumPy's rules for stretching arrays to make
>> their shapes match. For example, in an expression like::
>>
>>     np.array([1, 2, 3]) + 1
>>
>> the scalar `1` is "broadcast" to create a vector `[1, 1, 1]`. This is
>> accomplished without allocating memory, by creating a vector with
>> settings length = 3, strides = 0 – so all the elements share a single
>> location in memory. Similarly, by using negative strides we can
>> reverse an array without allocating memory::
>>
>>     a = np.array([1, 2, 3])
>>     a_flipped = a[::-1]
>>
>> Now `a_flipped` has the value `[3, 2, 1]`, while sharing storage with
>> the array `a = [1, 2, 3]`. Misaligned data is also possible (e.g. an
>> array of 8-byte doubles with a 9-byte stride), though it arises more
>> rarely. (An example of when it might occurs is in an on-disk data
>> format that alternates between storing a double value and then a
>> single byte value, which is then memory-mapped.)
>>
>> While this array representation is very flexible and useful, it makes
>> interfacing with BLAS a challenge: how do you perform a GEMM operation
>> on two arrays with arbitrary strides? Currently, NumPy attempts to
>> detect a number of special cases: if the strides in both arrays imply
>> a column-major layout, then call BLAS directly; if one of them has
>> strides corresponding to a row-major layout, then set the
>> corresponding `transA`/`transB` argument, etc. – and if all else
>> fails, either copy the data into a contiguous buffer, or else fall
>> back on a naive triple-nested-loop GEMM implementation. (There's also
>> a check where if we can determine through examining the arrays' data
>> pointers and strides that they're actually transposes of each other,
>> then we instead dispatch to SYRK.)
>>
>> So for us, native stride support in the BLAS has two major advantages:
>>
>> - It allows a wider variety of operations to be transparently handled
>> using high-speed kernels.
>>
>> - It reduces the complexity of the NumPy/BLAS interface code.
>>
>> Any kind of stride support produces both of these advantages to some
>> extent. The ideal would be if BLAS supports strides specified in bytes
>> (not elements), and allowing truly arbitrary values, because in this
>> case, we can simply pass through our arrays directly to the library
>> without having to do any fiddly checking at all. Note that for these
>> purposes, it's OK if "weird" arrays pay a speed penalty: if the BLAS
>> didn't support them, we'd just have to fall back on some even worse
>> strategy for handling them ourselves. And this approach would mean
>> that if some BLAS vendors do at some point decide to optimize these
>> cases, then our users will immediately see the advantages.
>>
>> ### Error-reporting
>>
>> We are **strongly in favor** of the draft's proposal to replace XERBLA
>> with proper error codes. XERBLA is fine for one-off programs in
>> Fortran, but awful for wrapper libraries like NumPy/SciPy.
>>
>> ### Handling of exceptional values
>>
>> A recurring issue in low-level linear-algebra libraries is that they
>> may crash, freeze, or produce unexpected results when receiving
>> non-finite inputs. Of course as a wrapper library, NumPy/SciPy can't
>> control what kind of strange data our users might pass in, and we have
>> to produce sensible results regardless, so we sometimes accumulate
>> hacks like having to validate all incoming data for exceptional values
>>
>> We support the proposal's recommendation of "NaN interpretation (4)".
>>
>> We also note that contrary to the note under "Interpretation (1)"; in
>> the R statistical environment NaN does *not* represent missing data. R
>> maintains a clear distinction between "NA" (missing) values and "NaN"
>> (invalid) values. This is important for various statistical procedures
>> such as imputation, where you might want to estimate the true value of
>> an unmeasured variable, but you don't want to estimate the true value
>> of 0/0. For efficiency, they do perform some calculations on NA values
>> by storing them as NaNs with a specific payload; in the R context,
>> therefore, it's particularly valuable if operations **correctly
>> propagate NaN payloads**. NumPy does not yet provide first-class
>> missing value support, but we plan to add it within the next 1-2
>> years, and when we do it seems likely that we'll have similar needs to
>> R.
>>
>> ### BLAS G2
>>
>> NumPy has a rich type system, including 16-, 32-, and 64-bit floats
>> (plus extended precision on some platforms), a similar set of complex
>> values, and is further extensible with new types. We have a similarly
>> rich dispatch system for operations that attempts to find the best
>> matching optimized-loop for an arbitrary set of input types. We're
>> thus quite interested in the proposed mixed-type operations, and if
>> implemented we'll probably start transparently taking advantage of
>> them (i.e., in cases where we're currently upcasting to perform a
>> requested operation, we may switch to using the native precision
>> operation without requiring any changes in user code).
>>
>> Two comments, though:
>>
>> 1. The various tricks to make names less redundant (e.g., specifying
>> just one type if they're all the same) are convenient for those
>> calling these routines by hand, but rather inconvenient for those of
>> us who will be generating a table mapping different type combinations
>> to different functions. When designing a compression scheme for names,
>> please remember that the scheme will have to be unambiguously
>> implemented in code by multiple projects. Another option to consider:
>> ask vendors to implement the full "uncompressed" names, and then
>> provide a shim library mapping short "convenience" names to their full
>> form.
>>
>> 2. Regarding the suggestion that the naming scheme will allow for a
>> wide variety of differently typed routines, but that only some subset
>> will be implemented by any particular vendor: we are worried about
>> this subset business. For a library like NumPy that wants to wrap
>> "all" the provided routines, while supporting "all" the different BLAS
>> libraries ... how will this work? **How do we know which routines any
>> particular library provides?** What if new routines are added to the
>> list later?
>>
>> ### Reproducible BLAS
>>
>> This is interesting, but we don't have any particularly useful comments.
>>
>> ### Batch BLAS
>>
>> NumPy/SciPy actually provide batched operations in many cases,
>> currently implemented using a `for` loop around individual calls to
>> BLAS/LAPACK. However, our interface is relatively restricted compared
>> to some of the proposals considered here: generally all we need is the
>> ability to perform an operation on two "stacks" of size-homogenous
>> matrices, with the offset between matrices determined by an arbitrary
>> (potentially zero) stride.
>>
>> ### Fixed-point BLAS
>>
>> This is interesting, but we don't have any particularly useful comments.
>>
>>
>> --
>> Nathaniel J. Smith -- https://vorpus.org
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at python.org
>> https://mail.python.org/mailman/listinfo/scipy-dev
>>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180109/4d2d3e09/attachment-0001.html>


More information about the SciPy-Dev mailing list