[SciPy-Dev] Boost for stats

Nicholas McKibben nicholas.bgp at gmail.com
Wed Feb 17 21:44:53 EST 2021


Hi all,

Responding to some comments I've seen fly by on this thread in no
particular order:

> However, playing devil's advocate somewhat:
> - does the scipy PR need the whole Boost.Math? If it only needs a select
subset (e.g., do we need root-finding etc?), then maybe the size can be
reduced.

As Hans mentioned, the Boost.Math depends on the whole of Boost, so not
without a lot of pain of detangling code and losing the ability to easily
bring in upstream updates.

> - do we need the whole thing? e.g. ufunc loops only need a select subset
of types.

Virtually all Boost functions (and certainly the ones we're dealing with in
the stats distributions) are templated. The ufunc generators I've written
specialize the templates to create all the types we need for the ufuncs
(single, double, and long double precision, specifically).  float16 could
be done in principle by unpacking to floats in the ufunc loop function, but
no other distribution considers float16 so I didn't either.

> - if we do go this route of taking parts / applying scipy specific
patches, what is easier to do or better maintenance-wise: vendor original
code + patches, or do the work once by porting relevant parts to standalone
C or C++ subset?

It sounds like the prefered option (taking the discussion here and in the
PR) is to include Boost as a submodule (which precludes SciPy specific
patches, incidentally) and track specific tagged commits or commits with
bug fixes as necessary.  The problem with porting to C is that we lose the
typing extensibility and easy upstream pulling of upstream bug fixes.
Existing C ports of Boost functions could (should?) be moved to use
Boost-proper to reduce maintenance burden.

> pybind11

The only reason I didn't consider pybind11 is because I've never used it
before and could get it done with Cython.  The only troublesome C++
features I ran into were non-type template parameters, but there are easy
workarounds for this.  If anyone would like to patch my PR to use pybind11,
please do!

> Probably good to make sure that aarch64 build times remain relatively
stable

Good point!  Can this be checked via a PR to scipy-wheels?

Best,
Nicholas

On Wed, Feb 17, 2021 at 7:09 PM Sam Wallan <samwallan at icloud.com> wrote:

> Hello,
>
> I’ve been working on a spreadsheet that compares Boost and SciPy. I looked
> at statistical distributions, special functions, and ODE solvers. Here’s
> the google sheets link:
>
>
> https://docs.google.com/spreadsheets/d/1zVaau6k1_0yQNW107D81RVCirWEN8sXwcYaWj2g8UNY/edit?usp=sharing
>
> I’ve left it on suggestion mode with that sharing link, so if anyone has
> any thoughts please feel free to leave a comment. It looks like Boost may
> have a lot to add!
>
> Regards,
>
> Sam
>
>
>
> > On Feb 15, 2021, at 4:48 AM, scipy-dev-request at python.org wrote:
> >
> > Send SciPy-Dev mailing list submissions to
> >       scipy-dev at python.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >       https://mail.python.org/mailman/listinfo/scipy-dev
> > or, via email, send a message with subject or body 'help' to
> >       scipy-dev-request at python.org
> >
> > You can reach the person managing the list at
> >       scipy-dev-owner at python.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of SciPy-Dev digest..."
> >
> >
> > Today's Topics:
> >
> >   1. Re: Boost for stats (Neal Becker)
> >   2. Re: Boost for stats (Hans Dembinski)
> >   3. Re: Boost for stats (Ralf Gommers)
> >   4. Re: Boost for stats (Neal Becker)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Mon, 15 Feb 2021 07:23:38 -0500
> > From: Neal Becker <ndbecker2 at gmail.com>
> > To: SciPy Developers List <scipy-dev at python.org>
> > Subject: Re: [SciPy-Dev] Boost for stats
> > Message-ID:
> >       <CAG3t+pHTnLa+EHL5G=_
> Esvi1unvYO0+DNnv8RxGKryuTS+jBUg at mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > I have been using   (and it's predecessor before it,
> > boost::python) to package c++ code for python use for many years,
> > including some of boost libraries.
> > pybind11 is easy to use and is much better than e.g., cython for
> > packaging c++ code.  pybind11 is also header-only.
> >
> > I would also like to call attention for anyone interested in
> > scientific software and c++ to a wonderful library (header-only),
> > xtensor
> > https://xtensor.readthedocs.io/en/latest/
> >
> > On Mon, Feb 15, 2021 at 7:15 AM Hans Dembinski <hans.dembinski at gmail.com>
> wrote:
> >>
> >>
> >>> On 15. Feb 2021, at 08:26, Andrew Nelson <andyfaff at gmail.com> wrote:
> >>>
> >>> My questions would be:
> >>>
> >>> - how portable is the boost code in general?
> >>
> >> It is very portable. The core goal of Boost is to offer implementations
> with quality and portability on par with the C++ standard library
> implementations. Non-portable extensions are sometimes used to speed up
> things, but there is always a standard compliant vanilla version. In
> practice, maintainers test portability with CI on Windows, OSX, Linux,
> using various versions of gcc, clang, msvc, intel, see e.g.
> >>
> >> https://github.com/boostorg/math/blob/develop/.github/workflows/ci.yml
> >>
> >> and the Boost build farm from the days before free CI for OSS was
> easily available,
> >>
> >> https://www.boost.org/development/tests/master/developer/move.html
> >>
> >> Not all compilers/platforms are fully compliant, of course. Boost uses
> workarounds to combat that and submits bug reports on the compiler bug
> trackers.
> >>
> >>> - how easy is it to install the library.
> >>
> >> As Nicholas mentioned, Boost.Math (and Boost.Histogram) is header-only,
> so it is sufficient to include the headers.
> >>
> >> Best regards,
> >> Hans
> >> _______________________________________________
> >> SciPy-Dev mailing list
> >> SciPy-Dev at python.org
> >> https://mail.python.org/mailman/listinfo/scipy-dev
> >
> >
> >
> > --
> > Those who don't understand recursion are doomed to repeat it
> >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Mon, 15 Feb 2021 13:35:25 +0100
> > From: Hans Dembinski <hans.dembinski at gmail.com>
> > To: SciPy Developers List <scipy-dev at python.org>
> > Subject: Re: [SciPy-Dev] Boost for stats
> > Message-ID: <13411649-82A4-4EC1-A58C-FAA3DDFF11D1 at gmail.com>
> > Content-Type: text/plain;     charset=us-ascii
> >
> >
> >> On 15. Feb 2021, at 01:47, Warren Weckesser <warren.weckesser at gmail.com>
> wrote:
> >>
> >> * The Boost histogram library might provide some benefits over the
> >>  existing NumPy and SciPy options.  (Hans Dembinski, the author
> >>  of the histrogram library, has already commented in this email
> >>  thread.)
> >
> > I would happily support this. We currently offer a Python front-end to
> Boost.Histogram
> > https://github.com/scikit-hep/boost-histogram
> > which includes a numpy.histogram compatible interface.
> >
> > Switching to Boost.Histogram may offer performance benefits, see
> >
> https://boost-histogram.readthedocs.io/en/latest/notebooks/PerformanceComparison.html
> >
> > Compared to np.histogram we saw a 1.7 times increase - single threaded,
> more if multiple threads are used. Compared to np.histogram2d we saw a 11
> times increase. These numbers should probably be checked more carefully
> before decisions are made.
> >
> > Boost.Histogram offers generalised histograms with arbitrary
> accumulators per cell, so it could also replace the implementations of
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html
> and friends.
> >
> > Best regards,
> > Hans
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Mon, 15 Feb 2021 13:41:51 +0100
> > From: Ralf Gommers <ralf.gommers at gmail.com>
> > To: SciPy Developers List <scipy-dev at python.org>
> > Subject: Re: [SciPy-Dev] Boost for stats
> > Message-ID:
> >       <
> CABL7CQjYZh0CyA6Kx5FULw2KaYMmdrLbm0Jecztc5+4z+r8OJg at mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > On Mon, Feb 15, 2021 at 1:35 PM Hans Dembinski <hans.dembinski at gmail.com
> >
> > wrote:
> >
> >>
> >>> On 15. Feb 2021, at 01:47, Warren Weckesser <
> warren.weckesser at gmail.com>
> >> wrote:
> >>>
> >>> * The Boost histogram library might provide some benefits over the
> >>>  existing NumPy and SciPy options.  (Hans Dembinski, the author
> >>>  of the histrogram library, has already commented in this email
> >>>  thread.)
> >>
> >> I would happily support this. We currently offer a Python front-end to
> >> Boost.Histogram
> >> https://github.com/scikit-hep/boost-histogram
> >> which includes a numpy.histogram compatible interface.
> >>
> >> Switching to Boost.Histogram may offer performance benefits, see
> >>
> >>
> https://boost-histogram.readthedocs.io/en/latest/notebooks/PerformanceComparison.html
> >>
> >> Compared to np.histogram we saw a 1.7 times increase - single threaded,
> >> more if multiple threads are used. Compared to np.histogram2d we saw a
> 11
> >> times increase. These numbers should probably be checked more carefully
> >> before decisions are made.
> >>
> >> Boost.Histogram offers generalised histograms with arbitrary
> accumulators
> >> per cell, so it could also replace the implementations of
> >>
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html
> >> and friends.
> >>
> >
> > That would be really nice. binned_statistic is currently pure Python, and
> > can be a performance hotspot (I've seen multiple cases of that in dealing
> > with image and geospatial data).
> >
> > Cheers,
> > Ralf
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> https://mail.python.org/pipermail/scipy-dev/attachments/20210215/5a972b62/attachment-0001.html
> >
> >
> > ------------------------------
> >
> > Message: 4
> > Date: Mon, 15 Feb 2021 07:47:48 -0500
> > From: Neal Becker <ndbecker2 at gmail.com>
> > To: SciPy Developers List <scipy-dev at python.org>
> > Subject: Re: [SciPy-Dev] Boost for stats
> > Message-ID:
> >       <CAG3t+pH-Eq2KfEwRbN0UVRSNmdSkY-DmhD=
> C-Tvo6bnPKWNH_w at mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > One thing I've missed with the current scipy histogram is the ability
> > to do 'online' or 'incremental' collection of the histogram data.  For
> > this reason I have written my own histogram code.  I am often
> > collecting data from monte-carlo simulations and want to accumulate
> > stats from data that arrives in batches.
> > I don't know if boost-histogram supports this but if so I would find
> > this very welcome.
> >
> > On Mon, Feb 15, 2021 at 7:42 AM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >>
> >>
> >>
> >> On Mon, Feb 15, 2021 at 1:35 PM Hans Dembinski <
> hans.dembinski at gmail.com> wrote:
> >>>
> >>>
> >>>> On 15. Feb 2021, at 01:47, Warren Weckesser <
> warren.weckesser at gmail.com> wrote:
> >>>>
> >>>> * The Boost histogram library might provide some benefits over the
> >>>>  existing NumPy and SciPy options.  (Hans Dembinski, the author
> >>>>  of the histrogram library, has already commented in this email
> >>>>  thread.)
> >>>
> >>> I would happily support this. We currently offer a Python front-end to
> Boost.Histogram
> >>> https://github.com/scikit-hep/boost-histogram
> >>> which includes a numpy.histogram compatible interface.
> >>>
> >>> Switching to Boost.Histogram may offer performance benefits, see
> >>>
> https://boost-histogram.readthedocs.io/en/latest/notebooks/PerformanceComparison.html
> >>>
> >>> Compared to np.histogram we saw a 1.7 times increase - single
> threaded, more if multiple threads are used. Compared to np.histogram2d we
> saw a 11 times increase. These numbers should probably be checked more
> carefully before decisions are made.
> >>>
> >>> Boost.Histogram offers generalised histograms with arbitrary
> accumulators per cell, so it could also replace the implementations of
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html
> and friends.
> >>
> >>
> >> That would be really nice. binned_statistic is currently pure Python,
> and can be a performance hotspot (I've seen multiple cases of that in
> dealing with image and geospatial data).
> >>
> >> Cheers,
> >> Ralf
> >>
> >> _______________________________________________
> >> SciPy-Dev mailing list
> >> SciPy-Dev at python.org
> >> https://mail.python.org/mailman/listinfo/scipy-dev
> >
> >
> >
> > --
> > Those who don't understand recursion are doomed to repeat it
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > SciPy-Dev mailing list
> > SciPy-Dev at python.org
> > https://mail.python.org/mailman/listinfo/scipy-dev
> >
> >
> > ------------------------------
> >
> > End of SciPy-Dev Digest, Vol 208, Issue 13
> > ******************************************
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scipy-dev/attachments/20210217/a1b79d2f/attachment-0001.html>


More information about the SciPy-Dev mailing list