[Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

Mon Apr 28 19:30:25 EDT 2014

On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn at uni-ulm.de> wrote:
>
> Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden at gmail.com>:
>
>> Sturla Molden <sturla.molden at gmail.com> wrote:
>>
>>> Making a totally new BLAS might seem like a crazy idea, but it might be the
>>> best solution in the long run.
>>
>> To see if this can be done, I'll try to re-implement cblas_dgemm and then
>> benchmark against MKL, Accelerate and OpenBLAS. If I can get the
>> performance better than 75% of their speed, without any assembly or dark
>
> So what percentage on performance did you achieve so far?

I finally read this paper:

   http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the
right starting point. They make a compelling argument that BLIS *is*
the cleaned up, maintainable, and yet still competitive
reimplementation of GotoBLAS/OpenBLAS that we all want, and that
getting there required a qualitative reorganization of the code (i.e.,
very hard to do incrementally). But they've done it. And, I get the
impression that the stuff they're missing -- threading, cross-platform
build stuff, and runtime CPU adaptation -- is all pretty
straightforward stuff that is only missing because no-one's gotten
around to sitting down and implementing it. (In particular that paper
does include impressive threading results; it sounds like given a
decent thread pool library one could get competitive performance
pretty trivially, it's just that they haven't been bothered yet to do
thread pools properly or systematically test which of the pretty-good
approaches to threading is "best". Which is important if your goal is
to write papers about BLAS libraries but irrelevant to reaching
minimal-viable-product stage.)

It would be really interesting if someone were to try hacking simple
runtime CPU detection into BLIS and see how far you could get -- right
now they do kernel selection via the C preprocessor, but hacking in
some function pointer thing instead would not be that hard I think. A
maintainable library that builds on Linux/OSX/Windows, gets
competitive performance on last-but-one generation x86-64 CPUs, and
gets better-than-reference-BLAS performance everywhere else, would be
a very very compelling product that I bet would quickly attract the
necessary attention to make it competitive on all CPUs.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org