From mhfh.kvd5 at gmail.com Wed Feb 1 08:32:03 2023 From: mhfh.kvd5 at gmail.com (m m) Date: Wed, 1 Feb 2023 14:32:03 +0100 Subject: [scikit-learn] mutual information for continuous variables with scikit-learn Message-ID: Hello, I have two continuous variables (heart rate samples over a period of time), and would like to compute mutual information between them as a measure of similarity. I've read some posts suggesting to use the mutual_info_score from scikit-learn but will this work for continuous variables? One stackoverflow answer suggested converting the data into probabilities with np.histogram2d() and passing the contingency table to the mutual_info_score. from sklearn.metrics import mutual_info_score def calc_MI(x, y, bins): c_xy = np.histogram2d(x, y, bins)[0] mi = mutual_info_score(None, None, contingency=c_xy) return mi # generate data L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]]) uncorrelated = np.random.standard_normal((2, 300)) correlated = np.dot(L, uncorrelated) A = correlated[0] B = correlated[1] x = (A - np.mean(A)) / np.std(A) y = (B - np.mean(B)) / np.std(B) # calculate MI mi = calc_MI(x, y, 50) Is calc_MI a valid approach? I'm asking because I also read that when variables are continuous, then the sums in the formula for discrete data become integrals, but I'm not sure if this procedure is implemented in scikit-learn? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From solegalli at protonmail.com Wed Feb 1 09:14:55 2023 From: solegalli at protonmail.com (Sole Galli) Date: Wed, 01 Feb 2023 14:14:55 +0000 Subject: [scikit-learn] mutual information for continuous variables with scikit-learn In-Reply-To: References: Message-ID: Hey, My understanding is that with sklearn you can compare 2 continuous variables like this: mutual_info_regression(data["var1"].to_frame(), data["var"], discrete_features=[False]) Where var1 and var are continuous. You can also compare multiple continuous variables against one continuous variables like this: mutual_info_regression(data[["var1", "var_2", "var_3"]], data["var"], discrete_features=[False, False, False]) I understand Sklearn uses nonparametric methods based on entropy estimation from k-nearest neighbors as explained in Nearest-neighbor approach to estimate the MI. Taken from Ross, 2014, PLoS ONE 9(2): e87357. More details here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html And I've got a blog post about Mutual info with Python here: https://www.blog.trainindata.com/mutual-information-with-python/ Cheers Sole Soledad Galli https://www.trainindata.com/ Sent with [Proton Mail](https://proton.me/) secure email. ------- Original Message ------- On Wednesday, February 1st, 2023 at 10:32 AM, m m wrote: > Hello, > > I have two continuous variables (heart rate samples over a period of time), and would like to compute mutual information between them as a measure of similarity. > > I've read some posts suggesting to use the mutual_info_score from scikit-learn but will this work for continuous variables? One stackoverflow answer suggested converting the data into probabilities with np.histogram2d() and passing the contingency table to the mutual_info_score. > > from sklearn.metrics import mutual_info_score > > def calc_MI(x, y, bins): > c_xy = np.histogram2d(x, y, bins)[0] > mi = mutual_info_score(None, None, contingency=c_xy) > return mi > > # generate data > L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]]) > uncorrelated = np.random.standard_normal((2, 300)) > correlated = np.dot(L, uncorrelated) > A = correlated[0] > B = correlated[1] > x = (A - np.mean(A)) / np.std(A) > y = (B - np.mean(B)) / np.std(B) > > # calculate MI > mi = calc_MI(x, y, 50) > > Is calc_MI a valid approach? I'm asking because I also read that when variables are continuous, then the sums in the formula for discrete data become integrals, but I'm not sure if this procedure is implemented in scikit-learn? > > Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Feb 1 09:18:40 2023 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 1 Feb 2023 15:18:40 +0100 Subject: [scikit-learn] mutual information for continuous variables with scikit-learn In-Reply-To: References: Message-ID: <20230201141840.qvabkmunulbmll7w@gaellaptop> For estimating mutual information on continuous variables, have a look at the corresponding package https://pypi.org/project/mutual-info/ G On Wed, Feb 01, 2023 at 02:32:03PM +0100, m m wrote: > Hello, > I have two continuous variables (heart rate samples over a period of time), and > would like to compute mutual information between them as a measure of > similarity. > I've read some posts suggesting to use the mutual_info_score from scikit-learn > but will this work for continuous variables? One stackoverflow answer suggested > converting the data into probabilities with np.histogram2d() and passing the > contingency table to the mutual_info_score. > from sklearn.metrics import mutual_info_score > def calc_MI(x, y, bins): > ? ? c_xy = np.histogram2d(x, y, bins)[0] > ? ? mi = mutual_info_score(None, None, contingency=c_xy) > ? ? return mi > # generate data > L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]]) > uncorrelated = np.random.standard_normal((2, 300)) > correlated = np.dot(L, uncorrelated) > A = correlated[0] > B = correlated[1] > x = (A - np.mean(A)) / np.std(A) > y = (B - np.mean(B)) / np.std(B) > # calculate MI > mi = calc_MI(x, y, 50) > Is calc_MI a valid approach? I'm asking because I also read that when variables > are continuous, then the sums in the formula for discrete data become > integrals, but I'm not sure if this procedure is implemented in scikit-learn? > Thanks! > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From mhfh.kvd5 at gmail.com Wed Feb 1 11:04:05 2023 From: mhfh.kvd5 at gmail.com (m m) Date: Wed, 1 Feb 2023 17:04:05 +0100 Subject: [scikit-learn] mutual information for continuous variables with scikit-learn In-Reply-To: <20230201141840.qvabkmunulbmll7w@gaellaptop> References: <20230201141840.qvabkmunulbmll7w@gaellaptop> Message-ID: Thanks Sole and Gael, I'll try both ways. Are the two methods fundamentally different, or will they give me similar results? Also, the majority of MI analysis I've seen with continuous variables discretize the data into arbitrary bins. Is this procedure actually valid? I'd think by discretizing continuous data we would be losing important variation in the data. On Wed, Feb 1, 2023 at 3:19 PM Gael Varoquaux wrote: > For estimating mutual information on continuous variables, have a look at > the corresponding package > https://pypi.org/project/mutual-info/ > > G > > On Wed, Feb 01, 2023 at 02:32:03PM +0100, m m wrote: > > Hello, > > > I have two continuous variables (heart rate samples over a period of > time), and > > would like to compute mutual information between them as a measure of > > similarity. > > > I've read some posts suggesting to use the mutual_info_score from > scikit-learn > > but will this work for continuous variables? One stackoverflow answer > suggested > > converting the data into probabilities with np.histogram2d() and passing > the > > contingency table to the mutual_info_score. > > > from sklearn.metrics import mutual_info_score > > > def calc_MI(x, y, bins): > > c_xy = np.histogram2d(x, y, bins)[0] > > mi = mutual_info_score(None, None, contingency=c_xy) > > return mi > > > # generate data > > L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]]) > > uncorrelated = np.random.standard_normal((2, 300)) > > correlated = np.dot(L, uncorrelated) > > A = correlated[0] > > B = correlated[1] > > x = (A - np.mean(A)) / np.std(A) > > y = (B - np.mean(B)) / np.std(B) > > > # calculate MI > > mi = calc_MI(x, y, 50) > > > Is calc_MI a valid approach? I'm asking because I also read that when > variables > > are continuous, then the sums in the formula for discrete data become > > integrals, but I'm not sure if this procedure is implemented in > scikit-learn? > > > Thanks! > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Research Director, INRIA > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guetlein at posteo.de Tue Feb 21 09:48:40 2023 From: guetlein at posteo.de (=?UTF-8?Q?Martin_G=C3=BCtlein?=) Date: Tue, 21 Feb 2023 14:48:40 +0000 Subject: [scikit-learn] classification model that can handle missing values w/o learning from missing values Message-ID: Hi, I am looking for a classification model in python that can handle missing values, without imputation and "without learning from missing values", i.e. without using the fact that the information is missing for the inference. Explained with the help of decision trees: * The algorithm should NOT learn whether missing values should go to the left or right child (like the HistGradientBoostingClassifier). * Instead it could built the prediction for each child node and aggregate these (like some Random Forest implementations do). If that is not possible in sci-kit learn, maybe you have already discussed this? Or you know of a fork of sci-kit learn that is able to do this, or some other python library? Any help would be really appreciated, kind regards, Martin P.S. Here is my use-case, in case you are interested: I have a binary classification problem with a positive and a negative class, and two types of features A and B. In my training data, I have a lot more data (90%) where B is missing. In my test data, I always have B, which is good because the B features are better than the A features. In the cases where B is present in the training data, the ratio of positive examples is much higher than when its missing. So what HistGradientBoostingClassifier does, it uses the fact that B is not missing in the test data, and predicts way too many positives. (Additionally, some feature values of type A are also often missing)