[SciPy-Dev] Boost for stats

Gregory Lee grlee77 at gmail.com
Mon Feb 15 17:24:24 EST 2021


On Sun, Feb 14, 2021 at 7:47 PM Warren Weckesser <warren.weckesser at gmail.com>
wrote:

>
>
> On Fri, Feb 12, 2021 at 4:50 PM Nicholas McKibben <nicholas.bgp at gmail.com>
> wrote:
>
>> Hi all,
>>
>> Many stats distributions in SciPy have outstanding issues with difficult
>> solutions in legacy code.  We've been working on replacing existing
>> statistical distributions with those found in Boost.Math.  The initial
>> implementation resolves almost a dozen issues for scipy.stats with
>> potential for resolving several more in scipy.stats and scipy.special in
>> future PRs.
>>
>> Initial PR: https://github.com/scipy/scipy/pull/1332
>> <https://github.com/scipy/scipy/pull/13328>
>> This PR includes the ability to easily add Boost functionality through
>> generated ufuncs.
>>
>> Boost is a large library and would incur the cost of one of the follow
>> - an additional dependency (e,g, boostinator
>> https://github.com/mckib2/boostinator) that outsources the packaging of
>> the Boost libraries
>> - the inclusion of Boost within SciPy either as a "clone and own" or
>> submodule
>>
>> The initial PR includes the zipped Boost headers only (~24MB zipped), but
>> adding Boost as a submodule might be a more maintainable approach if
>> changes to Boost need to be made in the future.
>>
>> Inclusion of the entire Boost library is a virtual necessity for the
>> Boost.Math module. Manual attempts to strip away unnecessary files and bcp
>> (Boost's utility to provide stripped down installations) fail to create
>> smaller sizes.  The increase in size would be similar to the following:
>>
>> SciPy master repo ~177 MB
>> Boost branch: ~221 MB
>>
>> Built: ~939 MB
>> Built With Boost: ~1090 MB
>>
>> Wheel size should not be significantly impacted because Boost is used as
>> a header-only library.
>>
>> I have no relationship with the Boost libraries other than as a user and
>> bug reporter.  I find them to be impressive and well-maintained with
>> tremendous support from both industry and open source developers.  SciPy
>> would benefit from the efficient, well-tested and maintained
>> implementations of stats and special algorithms.
>>
>> Thanks,
>> Nicholas
>>
>
>
> Thanks Nick, this is a long overdue enhancement for SciPy.
>
> For years, we've been fixing bugs in `stats` and `special` for
> functions that have high quality, thoroughly tested and license-
> compatible implementations in Boost.  Indeed, we often check our
> results against Boost.  I think it will be worth the effort of
> working out the interface issues and the packaging issues to allow
> SciPy to take advantage of the excellent code in Boost.
>
> The pull request focuses on using Boost functions to improve the
> implementations of some of probability distributions in `stats`,
> but in the long term, we should take advantage of Boost as much as
> we can.  The next obvious module is `special`, where it looks like we
> can close a bunch of issues by replacing the existing implementation
> of a function with the Boost version (e.g. `erfinv` [gh-12758],
> `betaincinv` [gh-12796], `jn_zeros` [gh-4690]).
>
> There are many other parts of Boost that would be great to have
> available in SciPy.  Here are few (based on a browse through the
> Boost docs; I don't know if wrapping any of these would have
> insurmountable technical obstacles):
>
> * ODE solvers; in particular, Boost has symplectic solvers, which
>   are not currently available in SciPy (gh-12690).
> * Interpolators: it looks like Boost has a few interpolators
>   that are not currently in SciPy.
> * The Boost histogram library might provide some benefits over the
>   existing NumPy and SciPy options.  (Hans Dembinski, the author
>   of the histrogram library, has already commented in this email
>   thread.)
>
> Warren
>
>

I think the Boost Graph Library is also of substantial interest and could
be used in SciPy. We would likely use max-flow algorithms downstream in
scikit-image for things like image segmentation and phase unwrapping if
they were made available. Discussion in
https://github.com/scikit-image/scikit-image/issues/4832 seems to indicate
that the scipy.sparse.csgraph.maximum_flow algorithm is fairly slow, and a
faster replacement would be welcome.

- Greg



>
> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at python.org
>> https://mail.python.org/mailman/listinfo/scipy-dev
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scipy-dev/attachments/20210215/f2df9ec3/attachment-0001.html>


More information about the SciPy-Dev mailing list