[scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

Andreas Mueller t3kcit at gmail.com
Fri Dec 14 10:46:10 EST 2018



On 12/13/18 4:16 AM, Joris Van den Bossche wrote:
> Hi all,
>
> I finally had some time to start looking at it the last days. Some 
> preliminary work can be found here: 
> https://github.com/jorisvandenbossche/target-encoder-benchmarks.
You continue to be my hero. Probably can not look at it in detail before 
the holidays though :-/
>
> Up to now, I only did some preliminary work to set up the benchmarks 
> (based on Patricio Cerda's code, 
> https://arxiv.org/pdf/1806.00979.pdf), and with some initial datasets 
> (medical charges and employee salaries) compared the different 
> implementations with its default settings.
> So there is still a lot to do (add datasets, investigate the actual 
> differences between the different implementations and results, in a 
> more structured way compare the options, etc, there are some todo's 
> listed in the README). However, now I am mostly on holidays for the 
> rest of December. If somebody wants to further look at it, that is 
> certainly welcome, otherwise, it will be a priority for me beginning 
> of January.
>
> For datasets: additional ideas are welcome. For now, the idea is to 
> add a subset of the Criteo Terabyte Click dataset, and to generate 
> some data.
>
> >>> Does that mean you'd be opposed to adding the leave-one-out TargetEncoder
> >>> I would really like to add it before February
> >> A few month to get it right is not that bad, is it?
> > The PR is over a year old already, and you hadn't voiced any opposition
> > there.
>
> As far as I understand, the open PR is not a leave-one-out TargetEncoder?
I would want it to be :-/
> I also did not yet add the CountFeaturizer from that scikit-learn PR, 
> because it is actually quite different (e.g it doesn't work for 
> regression tasks, as it counts conditional on y). But for 
> classification it could be easily added to the benchmarks.
I'm confused now. That's what TargetEncoder and leave-one-out 
TargetEncoder do as well, right?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181214/eac74499/attachment.html>


More information about the scikit-learn mailing list