[Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

Tue Apr 29 00:09:47 EDT 2014

Hi,

On Mon, Apr 28, 2014 at 5:50 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett <matthew.brett at gmail.com> wrote:
>> Hi,
>>
>> On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>> On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn at uni-ulm.de> wrote:
>>>>
>>>> Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden at gmail.com>:
>>>>
>>>>> Sturla Molden <sturla.molden at gmail.com> wrote:
>>>>>
>>>>>> Making a totally new BLAS might seem like a crazy idea, but it might be the
>>>>>> best solution in the long run.
>>>>>
>>>>> To see if this can be done, I'll try to re-implement cblas_dgemm and then
>>>>> benchmark against MKL, Accelerate and OpenBLAS. If I can get the
>>>>> performance better than 75% of their speed, without any assembly or dark
>>>>
>>>> So what percentage on performance did you achieve so far?
>>>
>>> I finally read this paper:
>>>
>>>    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf
>>>
>>> and I have to say that I'm no longer so convinced that OpenBLAS is the
>>> right starting point. They make a compelling argument that BLIS *is*
>>> the cleaned up, maintainable, and yet still competitive
>>> reimplementation of GotoBLAS/OpenBLAS that we all want, and that
>>> getting there required a qualitative reorganization of the code (i.e.,
>>> very hard to do incrementally). But they've done it. And, I get the
>>> impression that the stuff they're missing -- threading, cross-platform
>>> build stuff, and runtime CPU adaptation -- is all pretty
>>> straightforward stuff that is only missing because no-one's gotten
>>> around to sitting down and implementing it. (In particular that paper
>>> does include impressive threading results; it sounds like given a
>>> decent thread pool library one could get competitive performance
>>> pretty trivially, it's just that they haven't been bothered yet to do
>>> thread pools properly or systematically test which of the pretty-good
>>> approaches to threading is "best". Which is important if your goal is
>>> to write papers about BLAS libraries but irrelevant to reaching
>>> minimal-viable-product stage.)
>>>
>>> It would be really interesting if someone were to try hacking simple
>>> runtime CPU detection into BLIS and see how far you could get -- right
>>> now they do kernel selection via the C preprocessor, but hacking in
>>> some function pointer thing instead would not be that hard I think. A
>>> maintainable library that builds on Linux/OSX/Windows, gets
>>> competitive performance on last-but-one generation x86-64 CPUs, and
>>> gets better-than-reference-BLAS performance everywhere else, would be
>>> a very very compelling product that I bet would quickly attract the
>>> necessary attention to make it competitive on all CPUs.
>>
>> I wonder - is there anyone who might be able to do this work, if we
>> found funding for a couple of months to do it?
>
> Not much point in worrying about this I think until someone tries a
> proof of concept. But potentially even the labs working on BLIS would
> be interested in a small grant from NumFOCUS or something.

The problem is the time and mental energy involved in the
proof-of-concept may be enough to prevent it being done, and having
some money to pay for time and to placate employers may be useful in
overcoming that.

To be clear - not me - I will certainly help if I can, but being paid
isn't going to help me work on this.

Cheers,

Matthew