[scikit-learn] ANN Dirty_cat: learning on dirty categories

Tue Nov 20 16:06:30 EST 2018

I would love to see the TargetEncoder ported to scikit-learn.
The CountFeaturizer is pretty stalled:
https://github.com/scikit-learn/scikit-learn/pull/9614

:-/

Have you benchmarked the other encoders in the category_encoding lib?
I would be really curious to know when/how they help.

On 11/20/18 3:58 PM, Gael Varoquaux wrote:
> Hi scikit-learn friends,
>
> As you might have seen on twitter, my lab -with a few friends- has
> embarked on research to ease machine on "dirty data". We are
> experimenting on new encoding methods for non-curated string categories.
> For this, we are developing a small software project called "dirty_cat":
> https://dirty-cat.github.io/stable/
>
> dirty_cat is a test bed for new ideas of "dirty categories". It is a
> research project, though we still try to do decent software engineering
> :). Rather than contributing to existing codebases (as the great
> categorical-encoding project in scikit-learn-contrib), we spanned it out
> in a separate software project to have the freedom to try out ideas that
> we might give up after gaining insight.
>
> We hope that it is a useful tool: if you have non-curated string
> categories, please give it a try. Understanding what works and what does
> not is important to know what to consolidate. Hopefully one day we can
> develop a tool that is of wide-enough interest that it can go in
> scikit-learn-contrib, or maybe even scikit-learn.
>
> Also, if you have suggestions of publicly available databases that we try
> it upon, we would love to hear from you.
>
> Cheers,
>
> Gaël
>
> PS: if you want to work on dirty-data problems in Paris as a post-doc or
> an engineer, send me a line
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn