[scikit-learn] custom scorer needs group information: how?
Emanuele Olivetti
emanuele.olivetti at gmail.com
Sat May 22 03:02:50 EDT 2021
Dear scikit-learn Community,
I'd like to create a custom scorer to be used with GroupKFold and
GridSearchCV. The issue is that I need to use the grouping information also
"inside" the custom scorer, to compute the desired metric. How to do that?
Here follows a simplified example to explain in detail the issue.
Given this basic and common scenario:
---
X = <feature values>
y = <labels>
groups = np.array([0,0,1,1,1,1,2,2,3,3,3,...])
parameters = {'n_estimators': [10,100,1000], 'max_depth': [5,10,15]}
gkf = GroupKFold(n_splits=3)
clf = GridSearchCV(RandomForestClassifier(), parameters, scoring=my_scorer)
---
how to create my_scorer so that it computes, let's say, the "average
accuracy across groups"? Meaning that my_scorer should know not only y_true
and y_pred but also their grouping structure.
In principle, it should be something like in the following snippet, which
needs the group information "for the specific slice of data evaluated"
(which I call y_groups below)... a piece of information that I don't know
how to propagate there:
---
def my_score(y_true, y_pred, y_groups):
for group in np.unique(y_groups):
idx = y_group==group
result.append((y_true[idx] == y_pred[idx]).mean())
return np.mean(result)
my_scorer = make_scorer(my_score)
---
How can I make a custom scorer that uses inside the group information for
the specific predictions to be scored?
Thanks in advance for your help,
Emanuele
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20210522/9df6fb87/attachment.html>
More information about the scikit-learn
mailing list