[scikit-learn] Announcing skrub: Prepping tables for machine learning

Gael Varoquaux gael.varoquaux at normalesup.org
Tue Dec 19 03:10:13 EST 2023


Thanks for trying and for the feedback, Fernando!

I had not tried it on the Ames housing dataset. I just had a quick look at it, and I think that with the recent improvements in scikit-learn (namely native support of categorical column), we are going to be able to have an even better behavior out of the box.

Cheers,

Gaël

On Mon, Dec 18, 2023 at 09:18:03PM -0300, Fernando Marcos Wittmann wrote:
> Very strong baseline indeed. Did a quick check with the Ames housing dataset: 
> https://colab.research.google.com/drive/1RVVl_R5X3YYC7kj-B9uI5Fq7-SCYhYnD?usp=
> sharing

> Thanks all for the contribution! 

> On Mon, Dec 18, 2023 at 2:49 PM Gael Varoquaux <gael.varoquaux at normalesup.org>
> wrote:

>     Hi everyone,

>     We are very happy to announce the first release of a new package called
>     "skrub". It's goal is to facilitate data preparation from tables to machine
>     learning with an API similar to that of scikit-learn.
>     https://skrub-data.org

>     The most useful tool in the short term is the "TableVectorizer", which
>     applies a bunch of heuristics to turn a complex into a good data
>     representation for learning (for instance encoding dates, or strings).
>     Combined with scikit-learn HistGradientBoosting, it gives a strong baseline
>     for most tabular learning settings without data massaging:

>     from sklearn.ensemble import HistGradientBoostingRegressor
>     from sklearn.pipeline import make_pipeline
>     from skrub import TableVectorizer

>     pipeline = make_pipeline(TableVectorizer(), HistGradientBoostingRegressor
>     ())
>     pipeline.fit(X, y)


>     In the longer term, skrub will enable assembling full data processing
>     pipelines across multiple tables that can be cross-validated with
>     scikit_learn and one day put in production: Joining, Aggregation,
>     transformation to build models directly from the original tables and
>     database.

>     One example of such pipeline can be seen here:
>     https://skrub-data.org/stable/auto_examples/08_join_aggregation.html#
>     chaining-everything-together-in-a-pipeline

>     But there is a lot that remains to be done, and the questions are quite
>     open.

>     In my eyes, the dream is to bridge scikit-learn's API, that separates fit/
>     transform (because it helps making robust and valid predictive pipelines)
>     with dataframe/database operations. The goal is not to provide something as
>     flexible as SQL or pandas, but the cover the most frequent usecases in
>     machine learning, as explained here https://skrub-data.org/stable/
>     vision.html

>     Of course, skrub will be developed in the open, with an eye to quality,
>     staying as lightweight as possible while still providing powerful tool. I
>     hope that many will join this adventure!

>     Cheers,

>     Gaël

>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org
>     https://mail.python.org/mailman/listinfo/scikit-learn


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Research Director, INRIA
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux


More information about the scikit-learn mailing list