[scikit-learn] Request / Proposal: integrating IEEE paper in scikit-learn as "feature_selection.EFS / EFSCV" and cancer_benchmark datasets

Sun Sep 24 14:39:53 EDT 2023

Dear Dalibor,

As detailed in the FAQ,
https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms
"""
We only consider well-established algorithms for inclusion. A rule of thumb is at least 3 years since publication, 200+ citations, and wide use and usefulness.
"""

These days, I would say that the bar is even harder, as we are finding that we prioritize things such as high-quality documentation or better dataframe support to new algorithms.

Best,

Gaël

On Sun, Sep 24, 2023 at 11:10:23AM +0200, Dalibor Hrg wrote:
> Dear scikit-learn mailing list

> similarly to standing feature_selection.RFE and RFECV, this is a request to
> openly discuss the PROPOSAL and requirements of feature_selection.EFS and/or
> EFSCV which would stand for "Evolutionary Feature Selection" with starting 8
> algorithms or methods to be used with scikit-learn estimators, just as
> published in IEEE https://arxiv.org/abs/2303.10182 by the authors of paper.
> They agreed to help integrate it (in cc).

> PROPOSAL
> Implement/integrate https://arxiv.org/abs/2303.10182 paper into scikit-learn: 

> 1) CODE

>   • implementing feature_selection.EFS and/or EFSCV (a space for evolutionary
>     computing community interested in feature selection)

> RFE is:

> feature_selection.RFE          Feature ranking with recursive feature
> (estimator, *[, ...])          elimination.

> feature_selection.RFECV        Recursive feature elimination with
> (estimator, *[, ...])          cross-validation to select features.

>  The "EFS" could be:

>                         Feature ranking and feature elimination with 8
> feature_selection.EFS   different algorithms, SFE, SFE-PSO etc. <- new
> (estimator, *[, ...])   algorithms could be added and benchmarked with
>                         evolutionary computing, swarm, genetic etc.

> feature_selection.EFSCV Feature elimination with cross-validation to select
> (estimator, *[, ...])   features

> 2) DATASETS & CANCER BENCHMARK

>   • curating and integrating fetch of cancer_benchmark 40 datasets, directly in
>     scikit-learn or externally pullable somehow and maintained (space for
>     contributing expanding high-dimensional datasets on cancer topics).  

> fetch_cancer-benchmark Loads 40 individual cancer related high-dimensional
> (*[,, ...])            datasets for benchmarking feature selection methods
>                        (classification).

> 3) TUTORIAL / WEBSITE

>   • writing tutorial to replicate IEEE paper results with feature_selection.EFS
>     and/or EFSCV on cancer_benchmark (40 datasets)

> I have identified IEEE work https://arxiv.org/abs/2303.10182 to be of very
> interesting novelty in working with high-dimensional datasets as it reports
> small subsets of predictive features selected with SVM, KNN across 40 datasets.
> Replicability under BSD-3 and high quality under scikit-learn could assure
> benchmarking novel feature selection algorithms easier - in my very first
> opinion. Since this is the very first touch of myself with IEEE paper authors
> and the scikit-learn list altogether, we would welcome some help/guide
> how integration could work out, and if there is any interest on that line at
> all.  

> Kind regards
> Dalibor Hrg
> https://www.linkedin.com/in/daliborhrg/
>     

> On Sat, Sep 23, 2023 at 9:08 AM Alexandre Gramfort <alexandre.gramfort at inria.fr
> > wrote:

>     Dear Dalibor

>     you should discuss this on the main scikit-learn mailing list.

>     https://mail.python.org/mailman/listinfo/scikit-learn

>     Alex

>     On Fri, Sep 22, 2023 at 12:19 PM Dalibor Hrg <dalibor.hrg at gmail.com> wrote:

>         Dear sklearn feature_selection.RFE Team and IEEE Authors (in-cc),

>         This is a request to openly discuss the idea of potential for 
>         feature_selection.EFS which would stand for "Evolutionary Feature
>         Selection" or shortly EFS with starting 8 algorithms as published in
>         IEEE https://arxiv.org/abs/2303.10182 by the authors on
>         high-dimensional datasets. I have identified this work to be of very
>         interesting novelty in working with high-dimensional datasets,
>         especially for health fields, and it could mean a lot to the ML
>         community and scikit-learn project - in my very first opinion.   

>         A Jupyter Notebook and scikit-learn tutorial replicating this IEEE
>         paper/work as feature_selection.EFS and 8 algorithms in it could be a
>         near term goal. And eventually, scikit-learn EFSCV and diverse
>         classification algorithms could be benchmarked for "joint paper" in
>         JOSS, or a health journal.     

>         My initial idea (doesn't need to be that way or is open to discussion)
>         has some first thought like this: 
>          
>         RFE has:

>         feature_selection.RFE       Feature ranking with recursive feature
>         (estimator, *[, ...])       elimination.

>         feature_selection.RFECV     Recursive feature elimination with
>         (estimator, *[, ...])       cross-validation to select features.

>          The "EFS" could have:

>                                 Feature ranking and feature elimination with 8
>         feature_selection.EFS   different algorithms, SFE, SFE-PSO etc. <- new
>         (estimator, *[, ...])   algorithms could be added and benchmarked with
>                                 evolutionary computing, swarm, genetic etc.

>         feature_selection.EFSCV Feature elimination with cross-validation to
>         (estimator, *[, ...])   select features

>         Looking forward to an open discussion and if Evolutionary Feature
>         Selection EFS is something for sklearn project, or maybe a separate pip
>         install package. 

>         Kind regards
>         Dalibor Hrg
>         https://www.linkedin.com/in/daliborhrg/

>         On Fri, Sep 22, 2023 at 10:50 AM Behrooz Ahadzade <b.ahadzade at yahoo.com
>         > wrote:

>             Dear Dalibor Hrg,

>             Thank you very much for your attention to the SFE algorithm. Thank
>             you very much for the time you took to guide me and my colleagues.
>             According to your guidance, we will add this algorithm to the
>             scikit-learn library as soon as possible.

>             Kind regards,
>             Ahadzadeh.
>             On Wednesday, September 13, 2023 at 12:22:04 AM GMT+3:30, Dalibor
>             Hrg <dalibor.hrg at gmail.com> wrote:

>             Dear Authors,

>             you have done some amazing work on feature selection here published
>             in IEEE: https://arxiv.org/abs/2303.10182 . I have noticed Python
>             code here without a LICENSE file or any info on this: https://
>             github.com/Ahadzadeh2022/SFE and in the paper some links are
>             mentioned to download data.

>             I would be interested with you that we:

>             Step 1) make and release a pip package, publish this code in JOSS 
>             https://joss.readthedocs.io i.e. https://joss.theoj.org/papers/
>             10.21105/joss.04611 under BSD-3 license and replicate IEEE paper
>             table results. All 8 algorithms could be in potentially one class
>             "EFS" meaning "Evolutionary Feature Selection", selectable as 8
>             options among them SFE. Or something like that.  
>               
>             Step 2) try integrate and work with scikit-learn people, I would
>             recommend it to integrate this under https://scikit-learn.org/
>             stable/modules/classes.html#module-sklearn.feature_selection
>              similarly to sklearn.feature_selection.RFE. I believe this would
>             be a great contribution to the best open library for ML,
>             scikit-learn. 

>             I am unsure what is the status of datasets and licenses therein?.
>             But, the datasets could be fetched externally from OpenML.org
>             repository, for example https://scikit-learn.org/stable/datasets/
>             loading_other_datasets.html or CERN Zenodo where "benchmark
>             datasets" could be expanded. It depends a bit on the dataset
>             licenses? 

>             Overall, I hope this can hugely maximize your published work
>             visibility but also for others to credit you in papers in a more
>             citable and replicable way. I believe your IEEE paper and work
>             definitely deserve a spot in scikit-learn. There is need for some
>             replicable code on "Evolutionary Methods for Feature Selection" and
>             such Benchmark in life-science datasets, and you have done some
>             great work so far.

>             Let me know what you think. 

>             Best regards,
>             Dalibor Hrg

>             https://www.linkedin.com/in/daliborhrg/

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Research Director, INRIA
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux