[scikit-learn] custom scorer needs group information: how?

Sat May 22 03:02:50 EDT 2021

Dear scikit-learn Community,

I'd like to create a custom scorer to be used with GroupKFold and
GridSearchCV. The issue is that I need to use the grouping information also
"inside" the custom scorer, to compute the desired metric. How to do that?

Here follows a simplified example to explain in detail the issue.
Given this basic and common scenario:
---
X = <feature values>
y = <labels>
groups = np.array([0,0,1,1,1,1,2,2,3,3,3,...])
parameters = {'n_estimators': [10,100,1000], 'max_depth': [5,10,15]}
gkf = GroupKFold(n_splits=3)
clf = GridSearchCV(RandomForestClassifier(), parameters, scoring=my_scorer)
---
how to create my_scorer so that it computes, let's say, the "average
accuracy across groups"? Meaning that my_scorer should know not only y_true
and y_pred but also their grouping structure.
In principle, it should be something like in the following snippet, which
needs the group information "for the specific slice of data evaluated"
(which I call y_groups below)... a piece of information that I don't know
how to propagate there:
---
def my_score(y_true, y_pred, y_groups):
  for group in np.unique(y_groups):
    idx = y_group==group
    result.append((y_true[idx] == y_pred[idx]).mean())
  return np.mean(result)

my_scorer = make_scorer(my_score)
---
How can I make a custom scorer that uses inside the group information for
the specific predictions to be scored?

Thanks in advance for your help,

Emanuele
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20210522/9df6fb87/attachment.html>