[scikit-learn] Domain adaptation and cross-validation

Thu Feb 20 08:30:07 EST 2020

Hello
I am working on a binary classification task using machine learning.
One class C0 is built upon public data called D0. The other class C1 is
made of data named D1 that I generated. I actually generated two
datasets D1a and D1b with slightly different parameters.

I would like to evaluate the domain adaptation of a model in the context
of D0, D1a and D1b. It means that I would like to train a model on data
from D0 and D1a, and then, test its performance on D0 and D1b.
I plan to perform a k-fold cross-validation on D0 and bootstrapping on
D1a on D1b. For example, at each iteration, I would build training data
from k-1 folds of D0 and boostrapped data from D1a. Testing data would
be built upon the single remaining fold of D0 and bootstrapped data from
D1b.

I was thinking of using a class similar to those in
sklearn.model-selection (e.g. KFold or StratifiedKFold) to perform the
method described above.
The init function of the class would be initialized with the
KFold/StratifiedKFold and bootstrapping parameters.
The split function inside this class would be generic enough to handle
many datasets for either D0, or D1a, or D1b. This function's parameters
would be the usual X and y for data and targets, along with information
about the structure of X. This additional information would be
propagated from the fit function in a similar way as optional groups
parameter is passed to group-related split functions in classes such as
GroupKFold.
Here X would be the concatenation of train-test datasets (i.e. D0-like),
train only datasets (i.e. D1a-like) and test only datasets (i.e.
D1b-like). y would be built in a similar way.
The additional parameters may thus be two tuples for the start and end
indexes of train only datasets (i.e. D1a-like) and test only datasets
(i.e. D1b-like). These values would allow the split function to properly
operate on X and y by taking into account boundaries between dataset
types when building folds and performing bootstrapping.

As far as I know, I cannot perform such a procedure with scikit-learn
using the functions in sklearn.model-selection
(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py).
Did I miss something? Maybe somewhere else in the code?
If there is no implementation in scikit-learn, would you be interested
in a pull-request with such a function?

Best regards,
Johan Mazel
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd at sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd at sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.