[scikit-learn] How to not recalculate transformer in a Pipeline?

Mon Nov 28 13:46:08 EST 2016


On 11/28/2016 12:15 PM, Gael Varoquaux wrote:
>> Or would you cache the return of "fit" as well as "transform"?
> Caching fit rather than transform. Fit is usually the costly step.
>
>> Caching "fit" with joblib seems non-trivial.
> Why? Caching a function that takes the estimator and X and y should do
> it. The transformer would clone the estimator on fit, to avoid
> side-effects that would trigger recomputes.
I guess so. You'd handle parameters using an estimator_params dict in init
and pass that to the caching function?
>
> It's a pattern that I use often, I've just never coded a good transformer
> for it.
>
> On my usecases, it works very well, provided that everything is nicely
> seeded. Also, the persistence across sessions is a real time saver.
Yeah for sure :)