[scikit-learn] ANN Dirty_cat: learning on dirty categories

Wed Nov 21 11:35:11 EST 2018


On 11/21/18 10:34 AM, Gael Varoquaux wrote:
> Joris has just accepted to help with benchmarking. We can have
> preliminary results earlier. The question really is: out of the different
> variants that exist, which one should we choose. I think that it is a
> legitimate question that arises on many of our PRs.
Thanks Joris! I could also ask Jan to help ;)
The question for this particular issue for me is also "what are good 
benchmark datasets".
It's a somewhat different task than what you're benchmarking with dirty 
cat, right?
In dirty cat you used dirty categories, which is a subset of all 
high-cardinality categorical
variables.
Whether "clean" high cardinality variables like zip-codes or dirty ones 
are the better
benchmark is a bit unclear to me, and I'm not aware of a wealth of 
datasets for either :-/

>
> But in general, I don't think that we should rush things because of
> deadlines. Consequences of a rush are that we need to change things after
> merge, which is more work. I know that it is slow, but we are quite a
> central package.
I agree.