[scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

Thu Dec 13 04:16:28 EST 2018

Hi all,

I finally had some time to start looking at it the last days. Some
preliminary work can be found here:
https://github.com/jorisvandenbossche/target-encoder-benchmarks.

Up to now, I only did some preliminary work to set up the benchmarks (based
on Patricio Cerda's code, https://arxiv.org/pdf/1806.00979.pdf), and with
some initial datasets (medical charges and employee salaries) compared the
different implementations with its default settings.
So there is still a lot to do (add datasets, investigate the actual
differences between the different implementations and results, in a more
structured way compare the options, etc, there are some todo's listed in
the README). However, now I am mostly on holidays for the rest of December.
If somebody wants to further look at it, that is certainly welcome,
otherwise, it will be a priority for me beginning of January.

For datasets: additional ideas are welcome. For now, the idea is to add a
subset of the Criteo Terabyte Click dataset, and to generate some data.

>>> Does that mean you'd be opposed to adding the leave-one-out
TargetEncoder
>>> I would really like to add it before February
>> A few month to get it right is not that bad, is it?
> The PR is over a year old already, and you hadn't voiced any opposition
> there.

As far as I understand, the open PR is not a leave-one-out TargetEncoder?
I also did not yet add the CountFeaturizer from that scikit-learn PR,
because it is actually quite different (e.g it doesn't work for regression
tasks, as it counts conditional on y). But for classification it could be
easily added to the benchmarks.

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181213/f4c68de0/attachment.html>