[scikit-learn] Announcing skrub: Prepping tables for machine learning

Gael Varoquaux gael.varoquaux at normalesup.org
Tue Dec 19 03:10:13 EST 2023

Thanks for trying and for the feedback, Fernando!

I had not tried it on the Ames housing dataset. I just had a quick look at it, and I think that with the recent improvements in scikit-learn (namely native support of categorical column), we are going to be able to have an even better behavior out of the box.



On Mon, Dec 18, 2023 at 09:18:03PM -0300, Fernando Marcos Wittmann wrote:
> Very strong baseline indeed. Did a quick check with the Ames housing dataset: 
> https://colab.research.google.com/drive/1RVVl_R5X3YYC7kj-B9uI5Fq7-SCYhYnD?usp=
> sharing

> Thanks all for the contribution! 

> On Mon, Dec 18, 2023 at 2:49 PM Gael Varoquaux <gael.varoquaux at normalesup.org>
> wrote:

>     Hi everyone,

>     We are very happy to announce the first release of a new package called
>     "skrub". It's goal is to facilitate data preparation from tables to machine
>     learning with an API similar to that of scikit-learn.
>     https://skrub-data.org

>     The most useful tool in the short term is the "TableVectorizer", which
>     applies a bunch of heuristics to turn a complex into a good data
>     representation for learning (for instance encoding dates, or strings).
>     Combined with scikit-learn HistGradientBoosting, it gives a strong baseline
>     for most tabular learning settings without data massaging:

>     from sklearn.ensemble import HistGradientBoostingRegressor
>     from sklearn.pipeline import make_pipeline
>     from skrub import TableVectorizer

>     pipeline = make_pipeline(TableVectorizer(), HistGradientBoostingRegressor
>     ())
>     pipeline.fit(X, y)

>     In the longer term, skrub will enable assembling full data processing
>     pipelines across multiple tables that can be cross-validated with
>     scikit_learn and one day put in production: Joining, Aggregation,
>     transformation to build models directly from the original tables and
>     database.

>     One example of such pipeline can be seen here:
>     https://skrub-data.org/stable/auto_examples/08_join_aggregation.html#
>     chaining-everything-together-in-a-pipeline

>     But there is a lot that remains to be done, and the questions are quite
>     open.

>     In my eyes, the dream is to bridge scikit-learn's API, that separates fit/
>     transform (because it helps making robust and valid predictive pipelines)
>     with dataframe/database operations. The goal is not to provide something as
>     flexible as SQL or pandas, but the cover the most frequent usecases in
>     machine learning, as explained here https://skrub-data.org/stable/
>     vision.html

>     Of course, skrub will be developed in the open, with an eye to quality,
>     staying as lightweight as possible while still providing powerful tool. I
>     hope that many will join this adventure!

>     Cheers,

>     Gaël

>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org
>     https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

    Gael Varoquaux
    Research Director, INRIA
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

More information about the scikit-learn mailing list