[scikit-learn] scikit-learn Digest, Vol 30, Issue 25

Mon Oct 1 21:11:52 EDT 2018

The current roadmap is amazing. One feature that would be exciting is
better support for multilayer stacking with caching and the ability to add
models to already trained layers.

I saw this history: https://github.com/scikit-learn/scikit-learn/pull/8960

This library is very close:
* API is somewhat awkward, but otherwise good. Does not cache intermediate
steps. https://wolpert.readthedocs.io/en/latest/index.html

These solutions seem to allow only two layers:
*
https://github.com/scikit-learn/scikit-learn/issues/4816#issuecomment-217817717
* https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/
* https://github.com/scikit-learn/scikit-learn/pull/6674

The people who put these other libraries together have made an incredibly
welcome effort to solve a real need and it would be amazing to see a payoff
for their effort in the form of an addition of stacking to scikit-learn's
core library.

As another data point, I attached a simple implementation I put together to
illustrate what I think are core needs of this feature. Feel free to browse
the code. Here is the short list:
* Infinite layers (or at least 3 ;) )
* Choice of CV or OOB for each model
* Ability to add a new model to a layer after the stacked ensemble has been
trained and refit the pipeline such that only models that must be retrained
are retrained (i.e. train the added model and retrain all models in higher
layers)
* All standard scikit-learn pipeline goodness (introspection, grid search,
serializability, etc)

Thanks all! This library is making a real difference for good in the lives
of many people.

Jason

On Fri, Sep 28, 2018 at 11:35 AM <scikit-learn-request at python.org> wrote:

> Send scikit-learn mailing list submissions to
>         scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-request at python.org
>
> You can reach the person managing the list at
>         scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. Re: [ANN] Scikit-learn 0.20.0 (Sebastian Raschka)
>    2. Re: [ANN] Scikit-learn 0.20.0 (Andreas Mueller)
>    3. Re: [ANN] Scikit-learn 0.20.0 (Andreas Mueller)
>    4. Re: [ANN] Scikit-learn 0.20.0 (Manuel CASTEJ?N LIMAS)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 28 Sep 2018 11:10:50 -0500
> From: Sebastian Raschka <mail at sebastianraschka.com>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0
> Message-ID:
>         <EFC21EB5-CCC4-48CB-AEB5-2C938689EC5E at sebastianraschka.com>
> Content-Type: text/plain;       charset=us-ascii
>
> >
> > > I think model serialization should be a priority.
> >
> > There is also the ONNX specification that is gaining industrial adoption
> and that already includes open source exporters for several families of
> scikit-learn models:
> >
> > https://github.com/onnx/onnxmltools
>
>
> Didn't know about that. This is really nice! What do you think about
> referring to it under
> http://scikit-learn.org/stable/modules/model_persistence.html to make
> people aware that this option exists?
> Would be happy to add a PR.
>
> Best,
> Sebastian
>
>
>
> > On Sep 28, 2018, at 9:30 AM, Olivier Grisel <olivier.grisel at ensta.org>
> wrote:
> >
> >
> > > I think model serialization should be a priority.
> >
> > There is also the ONNX specification that is gaining industrial adoption
> and that already includes open source exporters for several families of
> scikit-learn models:
> >
> > https://github.com/onnx/onnxmltools
> >
> > --
> > Olivier
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 28 Sep 2018 13:38:39 -0400
> From: Andreas Mueller <t3kcit at gmail.com>
> To: scikit-learn at python.org
> Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0
> Message-ID: <96edd381-2352-f183-486a-b86e395a78f6 at gmail.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
>
>
> On 09/28/2018 12:10 PM, Sebastian Raschka wrote:
> >>> I think model serialization should be a priority.
> >> There is also the ONNX specification that is gaining industrial
> adoption and that already includes open source exporters for several
> families of scikit-learn models:
> >>
> >> https://github.com/onnx/onnxmltools
> >
> > Didn't know about that. This is really nice! What do you think about
> referring to it under
> http://scikit-learn.org/stable/modules/model_persistence.html to make
> people aware that this option exists?
> > Would be happy to add a PR.
> >
> >
> I don't think an open source runtime has been announced yet (or they
> didn't email me like they promised lol).
> I'm quite excited about this as well.
>
> Javier:
> The problem is not so much storing the "model" but storing how to make
> predictions. Different versions could act differently
> on the same data structure - and the data structure could change. Both
> happen in scikit-learn.
> So if you want to make sure the right thing happens across versions, you
> either need to provide serialization and deserialization for
> every version and conversion between those or you need to provide a way
> to store the prediction function,
> which basically means you need a turing-complete language (that's what
> ONNX does).
>
> We basically said doing the first is not feasible within scikit-learn
> given our current amount of resources, and no-one
> has even tried doing it outside of scikit-learn (which would be possible).
> Implementing a complete prediction serialization language (the second
> option) is definitely outside the scope of sklearn.
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 28 Sep 2018 13:41:13 -0400
> From: Andreas Mueller <t3kcit at gmail.com>
> To: scikit-learn at python.org
> Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0
> Message-ID: <4cfbb327-7489-70ff-8fa3-a21079ec0068 at gmail.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
>
>
> On 09/28/2018 01:38 PM, Andreas Mueller wrote:
> >
> >
> > On 09/28/2018 12:10 PM, Sebastian Raschka wrote:
> >>>> I think model serialization should be a priority.
> >>> There is also the ONNX specification that is gaining industrial
> >>> adoption and that already includes open source exporters for several
> >>> families of scikit-learn models:
> >>>
> >>> https://github.com/onnx/onnxmltools
> >>
> >> Didn't know about that. This is really nice! What do you think about
> >> referring to it under
> >> http://scikit-learn.org/stable/modules/model_persistence.html to make
> >> people aware that this option exists?
> >> Would be happy to add a PR.
> >>
> >>
> > I don't think an open source runtime has been announced yet (or they
> > didn't email me like they promised lol).
> > I'm quite excited about this as well.
> >
> > Javier:
> > The problem is not so much storing the "model" but storing how to make
> > predictions. Different versions could act differently
> > on the same data structure - and the data structure could change. Both
> > happen in scikit-learn.
> > So if you want to make sure the right thing happens across versions,
> > you either need to provide serialization and deserialization for
> > every version and conversion between those or you need to provide a
> > way to store the prediction function,
> > which basically means you need a turing-complete language (that's what
> > ONNX does).
> >
> > We basically said doing the first is not feasible within scikit-learn
> > given our current amount of resources, and no-one
> > has even tried doing it outside of scikit-learn (which would be
> > possible).
> > Implementing a complete prediction serialization language (the second
> > option) is definitely outside the scope of sklearn.
> >
> >
> Maybe we should add to the FAQ why serialization is hard?
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 28 Sep 2018 20:34:43 +0200
> From: Manuel CASTEJ?N LIMAS <mcasl at unileon.es>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn at python.org>
> Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0
> Message-ID:
>         <CAAQ3=
> UFntYo02YkR9YwrCjicb8A3cutpN47L4MYZWxeNNYP+1A at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> How about a docker based approach? Just thinking out loud
> Best
> Manuel
>
> El vie., 28 sept. 2018 19:43, Andreas Mueller <t3kcit at gmail.com> escribi?:
>
> >
> >
> > On 09/28/2018 01:38 PM, Andreas Mueller wrote:
> > >
> > >
> > > On 09/28/2018 12:10 PM, Sebastian Raschka wrote:
> > >>>> I think model serialization should be a priority.
> > >>> There is also the ONNX specification that is gaining industrial
> > >>> adoption and that already includes open source exporters for several
> > >>> families of scikit-learn models:
> > >>>
> > >>> https://github.com/onnx/onnxmltools
> > >>
> > >> Didn't know about that. This is really nice! What do you think about
> > >> referring to it under
> > >> http://scikit-learn.org/stable/modules/model_persistence.html to make
> > >> people aware that this option exists?
> > >> Would be happy to add a PR.
> > >>
> > >>
> > > I don't think an open source runtime has been announced yet (or they
> > > didn't email me like they promised lol).
> > > I'm quite excited about this as well.
> > >
> > > Javier:
> > > The problem is not so much storing the "model" but storing how to make
> > > predictions. Different versions could act differently
> > > on the same data structure - and the data structure could change. Both
> > > happen in scikit-learn.
> > > So if you want to make sure the right thing happens across versions,
> > > you either need to provide serialization and deserialization for
> > > every version and conversion between those or you need to provide a
> > > way to store the prediction function,
> > > which basically means you need a turing-complete language (that's what
> > > ONNX does).
> > >
> > > We basically said doing the first is not feasible within scikit-learn
> > > given our current amount of resources, and no-one
> > > has even tried doing it outside of scikit-learn (which would be
> > > possible).
> > > Implementing a complete prediction serialization language (the second
> > > option) is definitely outside the scope of sklearn.
> > >
> > >
> > Maybe we should add to the FAQ why serialization is hard?
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/scikit-learn/attachments/20180928/f52258e8/attachment.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 30, Issue 25
> ********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181001/6ae0749b/attachment-0002.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181001/6ae0749b/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Stacking (2).ipynb
Type: application/octet-stream
Size: 28943 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181001/6ae0749b/attachment-0001.obj>