From mhfh.kvd5 at gmail.com  Wed Feb  1 08:32:03 2023
From: mhfh.kvd5 at gmail.com (m m)
Date: Wed, 1 Feb 2023 14:32:03 +0100
Subject: [scikit-learn] mutual information for continuous variables with
 scikit-learn
Message-ID: <CAA5uDDLAq=y2dZEu_mSSFS9-gaKiBi8TXzL3gdqi04JqOTwLDA@mail.gmail.com>

Hello,

I have two continuous variables (heart rate samples over a period of time),
and would like to compute mutual information between them as a measure of
similarity.

I've read some posts suggesting to use the mutual_info_score from
scikit-learn but will this work for continuous variables? One stackoverflow
answer suggested converting the data into probabilities with
np.histogram2d() and passing the contingency table to the mutual_info_score.

from sklearn.metrics import mutual_info_score

def calc_MI(x, y, bins):
    c_xy = np.histogram2d(x, y, bins)[0]
    mi = mutual_info_score(None, None, contingency=c_xy)
    return mi

# generate data
L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]])
uncorrelated = np.random.standard_normal((2, 300))
correlated = np.dot(L, uncorrelated)
A = correlated[0]
B = correlated[1]
x = (A - np.mean(A)) / np.std(A)
y = (B - np.mean(B)) / np.std(B)

# calculate MI
mi = calc_MI(x, y, 50)

Is calc_MI a valid approach? I'm asking because I also read that when
variables are continuous, then the sums in the formula for discrete data
become integrals, but I'm not sure if this procedure is implemented in
scikit-learn?

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20230201/15c207d3/attachment.html>

From solegalli at protonmail.com  Wed Feb  1 09:14:55 2023
From: solegalli at protonmail.com (Sole Galli)
Date: Wed, 01 Feb 2023 14:14:55 +0000
Subject: [scikit-learn] mutual information for continuous variables with
 scikit-learn
In-Reply-To: <CAA5uDDLAq=y2dZEu_mSSFS9-gaKiBi8TXzL3gdqi04JqOTwLDA@mail.gmail.com>
References: <CAA5uDDLAq=y2dZEu_mSSFS9-gaKiBi8TXzL3gdqi04JqOTwLDA@mail.gmail.com>
Message-ID: <sNf0zYp83R4ZdlwliEzIq6f_5Sd4Z12zzWuMlfqBlfS96vN0di2CVKvzM19qL5i_LN0jA9L4G6IPlYavdrIH3AtxDg92LfuXdiOy-PZhfGQ=@protonmail.com>

Hey,

My understanding is that with sklearn you can compare 2 continuous variables like this:

mutual_info_regression(data["var1"].to_frame(), data["var"], discrete_features=[False])

Where var1 and var are continuous.

You can also compare multiple continuous variables against one continuous variables like this:

mutual_info_regression(data[["var1", "var_2", "var_3"]], data["var"],

discrete_features=[False, False, False])

I understand Sklearn uses nonparametric methods based on entropy estimation from k-nearest neighbors as explained in Nearest-neighbor approach to estimate the MI. Taken from Ross, 2014, PLoS ONE 9(2): e87357.

More details here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html

And I've got a blog post about Mutual info with Python here: https://www.blog.trainindata.com/mutual-information-with-python/

Cheers
Sole

Soledad Galli
https://www.trainindata.com/

Sent with [Proton Mail](https://proton.me/) secure email.

------- Original Message -------
On Wednesday, February 1st, 2023 at 10:32 AM, m m <mhfh.kvd5 at gmail.com> wrote:

> Hello,
>
> I have two continuous variables (heart rate samples over a period of time), and would like to compute mutual information between them as a measure of similarity.
>
> I've read some posts suggesting to use the mutual_info_score from scikit-learn but will this work for continuous variables? One stackoverflow answer suggested converting the data into probabilities with np.histogram2d() and passing the contingency table to the mutual_info_score.
>
> from sklearn.metrics import mutual_info_score
>
> def calc_MI(x, y, bins):
> c_xy = np.histogram2d(x, y, bins)[0]
> mi = mutual_info_score(None, None, contingency=c_xy)
> return mi
>
> # generate data
> L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]])
> uncorrelated = np.random.standard_normal((2, 300))
> correlated = np.dot(L, uncorrelated)
> A = correlated[0]
> B = correlated[1]
> x = (A - np.mean(A)) / np.std(A)
> y = (B - np.mean(B)) / np.std(B)
>
> # calculate MI
> mi = calc_MI(x, y, 50)
>
> Is calc_MI a valid approach? I'm asking because I also read that when variables are continuous, then the sums in the formula for discrete data become integrals, but I'm not sure if this procedure is implemented in scikit-learn?
>
> Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20230201/918a6b71/attachment.html>

From gael.varoquaux at normalesup.org  Wed Feb  1 09:18:40 2023
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 1 Feb 2023 15:18:40 +0100
Subject: [scikit-learn] mutual information for continuous variables with
 scikit-learn
In-Reply-To: <CAA5uDDLAq=y2dZEu_mSSFS9-gaKiBi8TXzL3gdqi04JqOTwLDA@mail.gmail.com>
References: <CAA5uDDLAq=y2dZEu_mSSFS9-gaKiBi8TXzL3gdqi04JqOTwLDA@mail.gmail.com>
Message-ID: <20230201141840.qvabkmunulbmll7w@gaellaptop>

For estimating mutual information on continuous variables, have a look at the corresponding package
https://pypi.org/project/mutual-info/

G

On Wed, Feb 01, 2023 at 02:32:03PM +0100, m m wrote:
> Hello,

> I have two continuous variables (heart rate samples over a period of time), and
> would like to compute mutual information between them as a measure of
> similarity.

> I've read some posts suggesting to use the mutual_info_score from scikit-learn
> but will this work for continuous variables? One stackoverflow answer suggested
> converting the data into probabilities with np.histogram2d() and passing the
> contingency table to the mutual_info_score.

> from sklearn.metrics import mutual_info_score

> def calc_MI(x, y, bins):
> ? ? c_xy = np.histogram2d(x, y, bins)[0]
> ? ? mi = mutual_info_score(None, None, contingency=c_xy)
> ? ? return mi

> # generate data
> L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]])
> uncorrelated = np.random.standard_normal((2, 300))
> correlated = np.dot(L, uncorrelated)
> A = correlated[0]
> B = correlated[1]
> x = (A - np.mean(A)) / np.std(A)
> y = (B - np.mean(B)) / np.std(B)

> # calculate MI
> mi = calc_MI(x, y, 50)

> Is calc_MI a valid approach? I'm asking because I also read that when variables
> are continuous, then the sums in the formula for discrete data become
> integrals, but I'm not sure if this procedure is implemented in scikit-learn?

> Thanks!

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Research Director, INRIA
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From mhfh.kvd5 at gmail.com  Wed Feb  1 11:04:05 2023
From: mhfh.kvd5 at gmail.com (m m)
Date: Wed, 1 Feb 2023 17:04:05 +0100
Subject: [scikit-learn] mutual information for continuous variables with
 scikit-learn
In-Reply-To: <20230201141840.qvabkmunulbmll7w@gaellaptop>
References: <CAA5uDDLAq=y2dZEu_mSSFS9-gaKiBi8TXzL3gdqi04JqOTwLDA@mail.gmail.com>
 <20230201141840.qvabkmunulbmll7w@gaellaptop>
Message-ID: <CAA5uDDLaq5CZCGdFSKy2KrOp94VVq2J10javokihJXBpg1AP2w@mail.gmail.com>

Thanks Sole and Gael, I'll try both ways. Are the two methods fundamentally
different, or will they give me similar results?
Also, the majority of MI analysis I've seen with continuous variables
discretize the data into arbitrary bins. Is this procedure actually valid?
I'd think by discretizing continuous data we would be losing important
variation in the data.

On Wed, Feb 1, 2023 at 3:19 PM Gael Varoquaux <gael.varoquaux at normalesup.org>
wrote:

> For estimating mutual information on continuous variables, have a look at
> the corresponding package
> https://pypi.org/project/mutual-info/
>
> G
>
> On Wed, Feb 01, 2023 at 02:32:03PM +0100, m m wrote:
> > Hello,
>
> > I have two continuous variables (heart rate samples over a period of
> time), and
> > would like to compute mutual information between them as a measure of
> > similarity.
>
> > I've read some posts suggesting to use the mutual_info_score from
> scikit-learn
> > but will this work for continuous variables? One stackoverflow answer
> suggested
> > converting the data into probabilities with np.histogram2d() and passing
> the
> > contingency table to the mutual_info_score.
>
> > from sklearn.metrics import mutual_info_score
>
> > def calc_MI(x, y, bins):
> >     c_xy = np.histogram2d(x, y, bins)[0]
> >     mi = mutual_info_score(None, None, contingency=c_xy)
> >     return mi
>
> > # generate data
> > L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]])
> > uncorrelated = np.random.standard_normal((2, 300))
> > correlated = np.dot(L, uncorrelated)
> > A = correlated[0]
> > B = correlated[1]
> > x = (A - np.mean(A)) / np.std(A)
> > y = (B - np.mean(B)) / np.std(B)
>
> > # calculate MI
> > mi = calc_MI(x, y, 50)
>
> > Is calc_MI a valid approach? I'm asking because I also read that when
> variables
> > are continuous, then the sums in the formula for discrete data become
> > integrals, but I'm not sure if this procedure is implemented in
> scikit-learn?
>
> > Thanks!
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
>     Gael Varoquaux
>     Research Director, INRIA
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20230201/14ecbc8a/attachment-0001.html>

From guetlein at posteo.de  Tue Feb 21 09:48:40 2023
From: guetlein at posteo.de (=?UTF-8?Q?Martin_G=C3=BCtlein?=)
Date: Tue, 21 Feb 2023 14:48:40 +0000
Subject: [scikit-learn] classification model that can handle missing values
 w/o learning from missing values
Message-ID: <c95204e1bd99dda2415bf2c1f4c7399d@posteo.de>

Hi,

I am looking for a classification model in python that can handle 
missing values, without imputation and "without learning from missing 
values", i.e. without using the fact that the information is missing for 
the inference.

Explained with the help of decision trees:
* The algorithm should NOT learn whether missing values should go to the 
left or right child (like the HistGradientBoostingClassifier).
* Instead it could built the prediction for each child node and 
aggregate these (like some Random Forest implementations do).

If that is not possible in sci-kit learn, maybe you have already 
discussed this? Or you know of a fork of sci-kit learn that is able to 
do this, or some other python library?

Any help would be really appreciated, kind regards,
Martin


P.S. Here is my use-case, in case you are interested: I have a binary 
classification problem with a positive and a negative class, and two 
types of features A and B. In my training data, I have a lot more data 
(90%) where B is missing. In my test data, I always have B, which is 
good because the B features are better than the A features. In the cases 
where B is present in the training data, the ratio of positive examples 
is much higher than when its missing. So what 
HistGradientBoostingClassifier does, it uses the fact that B is not 
missing in the test data, and predicts way too many positives. 
(Additionally, some feature values of type A are also often missing)