From nils106 at googlemail.com Tue Mar 3 02:49:51 2020 From: nils106 at googlemail.com (Nils Wagner) Date: Tue, 3 Mar 2020 08:49:51 +0100 Subject: [scikit-learn] tensorflow and scikit-learn Message-ID: Hi All, I am newbie to scikit-learn. Is it possible to use scikit-learn instead of tensorflow and keras in the attached script? Best regards, Nils -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- import matplotlib import matplotlib.pyplot as plt import pandas as pd import random import math import numpy as np np.random.seed(1) # # ModuleNotFoundError: No module named 'tensorflow' # from keras.models import Sequential from keras.layers import Dense from keras.optimizers import SGD def Amplitude(omega, zeta): """Analytic amplitude calculation""" A = 1/math.sqrt((1-omega**2)**2+(2*zeta*omega)**2) return A zeta_0 = 0.1 # Damping ratio w_min = 0.0 # Start frequency w_max = 10.0 # End frequency N_omega = 300 # Number of points per interval w = np.linspace(w_min, w_max, N_omega).reshape(-1, 1) Amplitude = np.vectorize(Amplitude) a = Amplitude(w, zeta_0) rnd_indices = np.random.rand(len(w)) < 0.80 x_train = w[rnd_indices] y_train = a[rnd_indices] x_test = w[~rnd_indices] y_test = a[~rnd_indices] print (x_train) print (x_test) input('Press enter to continue') # Create a model def baseline_model(): height = 100 model = Sequential() model.add(Dense(height, input_dim=1, activation='tanh', kernel_initializer='uniform')) model.add(Dense(height, input_dim=height, activation='tanh', kernel_initializer='uniform')) model.add(Dense(height, input_dim=height, activation='tanh', kernel_initializer='uniform')) model.add(Dense(1, input_dim=height, activation='linear', kernel_initializer='uniform')) sgd = SGD(lr=0.01, momentum=0.9, nesterov=True) model.compile(loss='mse', optimizer=sgd) return model # Training the model model = baseline_model() model.fit(x_train, y_train, epochs=1000, verbose = 0) plt.figure(figsize=(16,8)) plt.rcParams["font.family"] = "arial" plt.rcParams["font.size"] = "18" plt.semilogy(x_test, model.predict(x_test), 'og') plt.semilogy(x_train, model.predict(x_train), 'r') plt.semilogy(w, a, 'b') plt.xlabel('Driving Angular Frequency [Hz]') plt.ylabel('Amplitude [m]') plt.title('Oscillator Amplitude vs Driving Angular Frequency') plt.legend(['TensorFlow Test', 'TensorFlow Training', 'Analytic Solution']) plt.show() From niourf at gmail.com Tue Mar 3 07:36:41 2020 From: niourf at gmail.com (Nicolas Hug) Date: Tue, 3 Mar 2020 07:36:41 -0500 Subject: [scikit-learn] tensorflow and scikit-learn In-Reply-To: References: Message-ID: Hi Nils, From a quick glance it looks like you're building a fully connected multi-layer perceptron so yes, this is possible in scikit-learn with the neural_network module (check out the docs). The script would be quite different though, it's not just plug and play. Also, for anything more complex in neural nets, we would not recommend scikit-learn. Nicolas On 3/3/20 2:49 AM, Nils Wagner via scikit-learn wrote: > Hi All, > > I am newbie to scikit-learn. Is it possible to use scikit-learn > instead of tensorflow and keras in the attached script? > > Best regards, > ??????????????????????????? Nils > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue Mar 3 08:19:19 2020 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 3 Mar 2020 14:19:19 +0100 Subject: [scikit-learn] tensorflow and scikit-learn In-Reply-To: References: Message-ID: skorch is another nice library to do DL in sklearn based environments/workflows. On Tue., Mar. 3, 2020, 13:37 Nicolas Hug, wrote: > Hi Nils, > > From a quick glance it looks like you're building a fully connected > multi-layer perceptron so yes, this is possible in scikit-learn with the > neural_network module (check out the docs). The script would be quite > different though, it's not just plug and play. Also, for anything more > complex in neural nets, we would not recommend scikit-learn. > > Nicolas > On 3/3/20 2:49 AM, Nils Wagner via scikit-learn wrote: > > Hi All, > > I am newbie to scikit-learn. Is it possible to use scikit-learn instead of > tensorflow and keras in the attached script? > > Best regards, > Nils > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Mar 3 17:47:46 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 4 Mar 2020 09:47:46 +1100 Subject: [scikit-learn] distances Message-ID: I noticed a comment by @amueller on Gitter re considering a project on our distances implementations. I think there's a lot of work that can be done in unifying distances implementations... (though I'm not always sure the benefit.) I thought I would summarise some of the issues below, as I was unsure what Andy intended. As @jeremiedbb said, making n_jobs more effective would be beneficial. Reducing duplication between metrics.pairwise and neighbors._dist_metrics and kmeans would be noble (especially with regard to parameters, where scicpy.spatial's mahalanobis available through sklearn.metrics does not accept V but sklearn.neighbors does). and perhaps offer higher consistency of results and efficiencies. We also have idioms the code like "if the metric is euclidean, use squared=True where we only need a ranking, then take the squareroot" while neighbors metrics abstract this with an API by providing rdsit and rdist_to_dist. There are issues about making sure that pairwise_distances(metric='minkowski', p=2) is using the same implementation as pairwise_distances(metric='euclidean'), etc. We have issues with chunking and distributing computations in the case that metric params are derived from the dataset (ideally a training set). #16419 is a simple instance where the metric param is sample-aligned and needs to be chunked up. In other cases, we precompute some metric param over all the data, then pass it to each chunk worker, using _precompute_metric_params introduced in #12672. This is also relevant to #9555. While that initial implementation in #12672 is helpful and aims to maintain backwards compatibility, it makes some dubious choices. Firstly in terms of code structure it is not a very modular approach - each metric is handled with an if-then. Secondly, it *only* handles the chunking case, relying on the fact that these metrics are in scipy.spatial, and have a comparable handling of V=None and VI=None. In the Gower Distances PR (#9555) when implementing a metric locally, rather than relying on scipy.spatial, we needed to provide an implementation of these default parameters both when the data is chunked and when the metric function is called straight out. Thirdly, its approach to training vs test data is dubious. We don't formally label X and Y in pairwise_distances as train/test, and perhaps we should. Maintaining backwards compat with scipy's seuclidean and mahalanobis, our implementation stacks X and Y to each other if both are provided, and then calculates their variance. This means that users may be applying a different metric at train and at test time (if the variance of X as train and Y as test is substantially different), which I consider a silent error. We can either make the train/test nature of X and Y more explicit, or we can require that data-based parameters are given explicitly by the user and not implicitly computed. If I understand correctly, sklearn.neighbors will not compute V or VI for you, and it must be provided explicitly. (Requiring that the scaling of each feature be given explicitly in Gower seems like an unnecessary burden on the user, however.) Then there are issues like whether we should consistently set the diagonal to zero in all metrics where Y=None. In short, there are several projects in distances, and I'd support them being considered for work.... But it's a lot of engineering, if motivated by ML needs and consistency for users. J -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeremie.du-boisberranger at inria.fr Wed Mar 4 13:20:43 2020 From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger) Date: Wed, 4 Mar 2020 19:20:43 +0100 Subject: [scikit-learn] ANN: scikit-learn 0.22.2.post1 In-Reply-To: References: Message-ID: This is a minor release including a few bug fixes. Here is the full changelog: https://scikit-learn.org/stable/whats_new/v0.22.html#version-0-22-2 The 0.22.2.post1 release includes a packaging fix for the source distribution but the content of the packages is otherwise identical to the content of the wheels with the 0.22.2 version (without the .post1 suffix). Thank you very much to all who contributed to this release ! Regards, J?r?mie, on behalf of the scikit-learn maintainer team. From rawtevipula25 at gmail.com Thu Mar 5 10:00:44 2020 From: rawtevipula25 at gmail.com (Vipula Rawte) Date: Thu, 5 Mar 2020 10:00:44 -0500 Subject: [scikit-learn] Getting identical mse, r2, mae for different data Message-ID: I am getting identical metric evaluation values for different data, I printed the matrix shape too. Below is a screenshot: [image: image.png] -- Regards, Vipula Rawte -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 126984 bytes Desc: not available URL: From fhjaime96 at gmail.com Thu Mar 5 10:06:44 2020 From: fhjaime96 at gmail.com (Jaime Ferrando Huertas) Date: Thu, 5 Mar 2020 16:06:44 +0100 Subject: [scikit-learn] Getting identical mse, r2, mae for different data In-Reply-To: References: Message-ID: Can you provide the code that produces this output? El jue., 5 mar. 2020 a las 16:03, Vipula Rawte () escribi?: > I am getting identical metric evaluation values for different data, I > printed the matrix shape too. > > Below is a screenshot: > > [image: image.png] > > -- > Regards, > Vipula Rawte > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 126984 bytes Desc: not available URL: From rawtevipula25 at gmail.com Thu Mar 5 12:07:27 2020 From: rawtevipula25 at gmail.com (Vipula Rawte) Date: Thu, 5 Mar 2020 12:07:27 -0500 Subject: [scikit-learn] Getting identical mse, r2, mae for different data In-Reply-To: References: Message-ID: import os import sys import csv import pandas as pd from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error from sklearn.feature_extraction.text import TfidfVectorizer from nltk.tokenize import RegexpTokenizer import re import numpy as np from sklearn.svm import SVR import time from scipy.sparse import csr_matrix from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict from sklearn import metrics import copy from multiscorer import MultiScorer start = time.time() #print("metrics: ", metrics.SCORERS.keys()) mae_file = open('mae_scores.txt', 'w') mse_file = open('mse_scores.txt', 'w') r2_file = open('r2_scores.txt', 'w') def tokenizer(text): if text: result = re.findall('[a-z]{2,}', text.lower()) else: result = [] return result def tfidf_vect(X): vect = TfidfVectorizer(tokenizer=tokenizer, stop_words='english') v = vect.fit(X) X_vect = v.transform(X) return X_vect def compute(X_vect,y): scorer = MultiScorer({ 'r2' : (r2_score , {}), 'mse' : (mean_squared_error, {}), 'mae' : (mean_absolute_error, {}) }) #SVR model model = SVR(C=1.0, epsilon=0.2, kernel= "poly") X_train, X_test, y_train, y_test = train_test_split(X_vect, y, test_size=0.33, shuffle=False, random_state=42) model.fit(X_train, y_train) pred = model.predict(X_test) print("mse: ", mean_squared_error(pred, y_test)) print("mae: ", mean_absolute_error(pred, y_test)) print("r2_score: ", r2_score(pred, y_test)) ''' # Perform 6-fold cross validation scores = cross_val_score(model, X_vect, y, cv=10, scoring=scorer) results = scorer.get_results() print("len: ", X_vect.shape[0]) final_scores = [] for metric_name in results.keys(): average_score = np.average(results[metric_name]) print('%s : %f' % (metric_name, average_score)) final_scores.append(average_score) r2_file.write(str(final_scores[0]) + '\n') mse_file.write(str(final_scores[1]) + '\n') mae_file.write(str(final_scores[2]) + '\n') ''' ''' df_header = ['cik_year', 'words', 'sent_words', 'roa', 'eps', 'tobinq', 'tier1_c', 'leverage', 'Z_score_c'] #10K_t+1 df1 = pd.read_csv("list_10K_next.txt", header=None, usecols=[0], names=['cik_year']) df21 = pd.read_csv("train_2006_2011_scaled.csv", usecols=['cik_year', 'mda_words', 'mda_sent_words', 'scaled_roa']) df22 = pd.read_csv("test_2006_2011_scaled.csv", usecols=['cik_year', 'mda_words', 'mda_sent_words', 'scaled_roa']) df23 = pd.read_csv("train_2007_2012_scaled.csv", usecols=['cik_year', 'mda_words', 'mda_sent_words', 'scaled_roa']) df24 = pd.read_csv("test_2007_2012_scaled.csv", usecols=['cik_year', 'mda_words', 'mda_sent_words', 'scaled_roa']) df2 = pd.concat([df21, df22, df23, df24]) df5 = df2.copy() searchfor1 = df1['cik_year'].values.tolist() df2 = df2[df2.cik_year.str.contains('|'.join(searchfor1))].reset_index() del df2['index'] #all_perf_indicators basepath1 = "/data/ftm/xgb_regr/ch_an_data/bank_all_perf_ind_data/" dp11 = pd.read_csv(basepath1 + "train_2007_2012.csv") dp12 = pd.read_csv(basepath1 + "test_2007_2012.csv") dp1 = pd.concat([dp11, dp12]) searchfor1 = df1['cik_year'].values.tolist() dp1 = dp1[dp1.cik_year.str.contains('|'.join(searchfor1))].reset_index() del dp1['index'] dp1 = dp1.drop_duplicates() df2 = pd.merge(df2, dp1) df2 = df2.drop_duplicates() df2['prev_cik_year'] = df2['cik_year'].apply(lambda x: x.split("_")[0] + "_" + str(int(x.split("_")[1]) - 1)) #8K_t df3 = pd.read_csv("list_8K.txt", header=None, usecols=[0], names=['cik_year']) df41 = pd.read_csv("train_8K_2006_2011_scaled.csv", usecols=['cik_year', 'mda_words', 'mda_sent_words', 'scaled_roa']) df42 = pd.read_csv("test_8K_2006_2011_scaled.csv", usecols=['cik_year', 'mda_words', 'mda_sent_words', 'scaled_roa']) df43 = pd.read_csv("train_8K_2007_2012_scaled.csv", usecols=['cik_year', 'mda_words', 'mda_sent_words', 'scaled_roa']) df44 = pd.read_csv("test_8K_2007_2012_scaled.csv", usecols=['cik_year', 'mda_words', 'mda_sent_words', 'scaled_roa']) df4 = pd.concat([df41, df42, df43, df44]) searchfor1 = df3['cik_year'].values.tolist() df4 = df4[df4.cik_year.str.contains('|'.join(searchfor1))].reset_index() del df4['index'] df4 = pd.merge(df4, df2, left_on='cik_year', right_on='prev_cik_year') df4 = df4.drop_duplicates() df4 = df4.rename({'cik_year_x':'cik_year', 'mda_words_x':'words', 'mda_sent_words_x':'sent_words', 'scaled_roa_y': 'roa', 'eps_scaled': 'eps', 'tobinq_scaled': 'tobinq', 'tier1_c_scaled': 'tier1_c', 'leverage_scaled': 'leverage', 'Z_score_c_scaled': 'Z_score_c'}, axis=1) df4.to_csv("8K_t.csv", columns=df_header) #10K_t searchfor1 = df3['cik_year'].values.tolist() df5 = df5[df5.cik_year.str.contains('|'.join(searchfor1))].reset_index() del df5['index'] df5 = pd.merge(df5, df2, left_on='cik_year', right_on='prev_cik_year') df5 = df5.drop_duplicates() df5 = df5.rename({'cik_year_x':'cik_year', 'mda_words_x':'words', 'mda_sent_words_x':'sent_words', 'scaled_roa_x': 'roa', 'eps_prev_scaled': 'eps', 'tobinq_prev_scaled': 'tobinq', 'tier1_c_prev_scaled': 'tier1_c', 'leverage_prev_scaled': 'leverage', 'Z_score_c_prev_scaled': 'Z_score_c'}, axis=1) df5.to_csv("10K_t.csv", columns=df_header) df2 = df2.rename({'mda_words':'words', 'mda_sent_words':'sent_words', 'scaled_roa': 'roa', 'eps_scaled': 'eps', 'tobinq_scaled': 'tobinq', 'tier1_c_scaled': 'tier1_c', 'leverage_scaled': 'leverage', 'Z_score_c_scaled': 'Z_score_c'}, axis=1) df2.to_csv("10K_t1.csv", columns=df_header) ''' #print("after 8K: ", len(df2), len(df4), len(df5), list(df2), list(df4), list(df5)) ''' df_10K_t1 = pd.read_csv("10K_t1.csv") df_10K_t = pd.read_csv("10K_t.csv") word_type = ['words', 'sent_words'] target = ['roa', 'eps', 'tobinq', 'tier1_c', 'leverage', 'Z_score_c'] for t in target: print(t) print(df_10K_t1[t]) print(df_10K_t[t]) for w in word_type: for t in target: print("w: ", w, "t: ", t) #8K print("8K") df_8K_t = pd.read_csv("8K_t.csv") X_8K = df_8K_t[w].values.astype('U') y_8K = df_8K_t[t] X_vect_8K = tfidf_vect(X_8K) compute(X_vect_8K, y_8K) #10K_t+1 print("10K_t+1") df_10K_t1 = pd.read_csv("10K_t1.csv") X_10K1 = df_10K_t1[w].values.astype('U') y_10K1 = df_10K_t1[t] X_vect_10K1 = tfidf_vect(X_10K1) compute(X_vect_10K1, y_10K1) #10K_t print("10K_t") df_10K_t = pd.read_csv("10K_t.csv") X_10K = df_10K_t[w].values.astype('U') y_10K = df_10K_t[t] X_vect_10K = tfidf_vect(X_10K) #8K+10K (concat) print("8K+10K (concat)") X_vect_concat = csr_matrix(pd.concat([pd.DataFrame(X_vect_8K.todense()), pd.DataFrame(X_vect_10K1.todense())], axis=1)) compute(X_vect_concat, y_10K1) #8K+10K (sum) print("#8K+10K (sum)") X_vect_sum = pd.DataFrame(X_vect_8K.todense()).add(pd.DataFrame(X_vect_10K1.todense()), fill_value=0) compute(X_vect_sum, y_10K1) #changes print("#changes") X_vect_diff = pd.DataFrame(X_vect_10K1.todense()).subtract(pd.DataFrame(X_vect_10K.todense()), fill_value=0) compute(X_vect_diff, y_10K1) mae_file.close() mse_file.close() r2_file.close() ''' #df = pd.read_csv("10K_t.csv") #v = df[df.duplicated(['words'], keep=False)] #v = pd.concat(g for _, g in df.groupby("words"))# if len(g) > 1) #print(v) #print(df['words']) w = "words" t = "leverage" #8K print("8K") df_8K_t = pd.read_csv("8K_t.csv") X_8K = df_8K_t[w].values.astype('U') y_8K = df_8K_t[t] X_vect_8K = tfidf_vect(X_8K) compute(X_vect_8K, y_8K) print("8K", type(X_vect_8K), X_vect_8K.shape) #10K_t+1 print("10K_t+1") df_10K_t1 = pd.read_csv("10K_t1.csv") X_10K1 = df_10K_t1[w].values.astype('U') y_10K1 = df_10K_t1[t] X_vect_10K1 = tfidf_vect(X_10K1) compute(X_vect_10K1, y_10K1) print("10K_t1", type(X_vect_10K1), X_vect_10K1.shape) #10K_t print("10K_t") df_10K_t = pd.read_csv("10K_t.csv") X_10K = df_10K_t[w].values.astype('U') y_10K = df_10K_t[t] X_vect_10K = tfidf_vect(X_10K) compute(X_vect_10K, y_10K) print("10K: ", type(X_vect_10K), X_vect_10K.shape) #8K+10K (concat) print("8K+10K (concat)") X_vect_concat = csr_matrix(pd.concat([pd.DataFrame(X_vect_8K.todense()), pd.DataFrame(X_vect_10K1.todense())], axis=1)) compute(X_vect_concat, y_10K1) print("8K +10K concat: ", type(X_vect_concat), X_vect_concat.shape) #8K+10K (sum) print("#8K+10K (sum)") X_vect_sum = pd.DataFrame(X_vect_8K.todense()).add(pd.DataFrame(X_vect_10K1.todense()), fill_value=0) compute(X_vect_sum, y_10K1) print("8K + 10K sum: ", type(X_vect_sum), X_vect_sum.shape) #changes print("#changes") X_vect_diff = pd.DataFrame(X_vect_10K1.todense()).subtract(pd.DataFrame(X_vect_10K.todense()), fill_value=0) compute(X_vect_diff, y_10K1) print("changes: ", type(X_vect_diff), X_vect_diff.shape) print((X_vect_10K1.todense()==X_vect_diff.todense())) print("Total execution time: ", time.time() - start) On Thu, Mar 5, 2020 at 10:08 AM Jaime Ferrando Huertas wrote: > Can you provide the code that produces this output? > > El jue., 5 mar. 2020 a las 16:03, Vipula Rawte () > escribi?: > >> I am getting identical metric evaluation values for different data, I >> printed the matrix shape too. >> >> Below is a screenshot: >> >> [image: image.png] >> >> -- >> Regards, >> Vipula Rawte >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Regards, Vipula Rawte -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 126984 bytes Desc: not available URL: From rawtevipula25 at gmail.com Thu Mar 5 12:09:55 2020 From: rawtevipula25 at gmail.com (Vipula Rawte) Date: Thu, 5 Mar 2020 12:09:55 -0500 Subject: [scikit-learn] Getting identical mse, r2, mae for different data In-Reply-To: References: Message-ID: On Thu, Mar 5, 2020 at 10:08 AM Jaime Ferrando Huertas wrote: > Can you provide the code that produces this output? > > El jue., 5 mar. 2020 a las 16:03, Vipula Rawte () > escribi?: > >> I am getting identical metric evaluation values for different data, I >> printed the matrix shape too. >> >> Below is a screenshot: >> >> [image: image.png] >> >> -- >> Regards, >> Vipula Rawte >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Regards, Vipula Rawte -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 126984 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: refine_8K_perf_ind_prac.py Type: text/x-python Size: 6791 bytes Desc: not available URL: From t3kcit at gmail.com Thu Mar 5 16:12:43 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 5 Mar 2020 16:12:43 -0500 Subject: [scikit-learn] distances In-Reply-To: References: Message-ID: Thanks for a great summary of issues! I agree there's lots to do, though I think most of the issues that you list are quite hard and require thinking about API pretty hard. So they might not be super amendable to being solved by a shorter-term project. I was hoping there would be some more easy wins that we could get by exploiting OpenMP better (or at all) in the distances. Not sure if there is, though. I wonder if having a multicore implementation of euclidean_distances would be useful for us, or if that's going too low-level. On 3/3/20 5:47 PM, Joel Nothman wrote: > I noticed a comment by?@amueller on Gitter re?considering a project on > our distances implementations. > > I think there's a lot of work that can be done in unifying distances > implementations... (though I'm not always sure the benefit.) I thought > I would?summarise some of the issues below, as I was unsure what Andy > intended. > > As @jeremiedbb said, making n_jobs more effective would be beneficial. > Reducing duplication between metrics.pairwise and > neighbors._dist_metrics and kmeans would?be noble (especially with > regard to parameters, where scicpy.spatial's mahalanobis available > through sklearn.metrics does not accept V but sklearn.neighbors does). > and perhaps offer higher consistency of results and efficiencies. > > We also have idioms the code like "if the metric is euclidean, use > squared=True where we only need a ranking, then take the squareroot" > while neighbors metrics abstract this with an API by providing rdsit > and rdist_to_dist. > > There are issues about making sure that > pairwise_distances(metric='minkowski', p=2) is using the same > implementation as pairwise_distances(metric='euclidean'), etc. > > We have issues with chunking and distributing computations in the case > that metric params are derived from the dataset (ideally a training?set). > > #16419 is a simple instance where the metric param is sample-aligned > and needs to be chunked up. > > In other cases, we precompute some metric param over all the data, > then pass it to each chunk worker, using _precompute_metric_params > introduced in #12672. This is also relevant to #9555. > > While that initial implementation in #12672 is helpful and aims to > maintain backwards compatibility, it makes some dubious choices. > > Firstly in terms of code structure it is not a very modular approach - > each metric is handled with an if-then. Secondly, it *only* handles > the chunking case, relying on the fact that these metrics are in > scipy.spatial, and have a comparable handling of V=None and VI=None. > In the Gower Distances PR (#9555) when implementing a metric locally, > rather than relying on scipy.spatial, we needed to provide an > implementation of these default parameters both when the data is > chunked and when the metric function is called straight out. > > Thirdly, its approach to training vs test data is dubious. We don't > formally label X and Y in pairwise_distances as train/test, and > perhaps we should. Maintaining backwards compat with scipy's > seuclidean and mahalanobis, our implementation stacks X and Y to each > other if both are provided, and then calculates their variance. This > means that users may be applying a different metric at train and at > test time (if the variance of X as train and Y as test is > substantially different), which I consider a silent error. We can > either make the train/test nature of X and Y more explicit, or we can > require that data-based parameters are given explicitly by the user > and not implicitly computed. If I understand correctly, > sklearn.neighbors will not compute V or VI for you, and it must be > provided explicitly. (Requiring that the scaling of each feature be > given explicitly in Gower seems like an unnecessary burden on the > user, however.) > > Then there are issues like whether we should consistently set the > diagonal to zero in all metrics where Y=None. > > In short, there are several projects in distances, and I'd support > them being considered for work.... But it's a lot of engineering, if > motivated by ML needs and consistency for users. > > J > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeremie.du-boisberranger at inria.fr Fri Mar 6 05:00:45 2020 From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger) Date: Fri, 6 Mar 2020 11:00:45 +0100 Subject: [scikit-learn] distances In-Reply-To: References: Message-ID: <30c505cf-c178-b81c-6aa5-bf047baeaede@inria.fr> Although pairwise distances are very good candidates for OpenMP based multi-threading due to their embarrassingly parallel nature, I think euclidean distances (from the pairwise module) is the one which will less benefit from that. It's implementation, using the dot trick, uses BLAS level 3 routine (matrix matrix multiplication) which will always be better optimized, better parallelized, have runtime cpu detection. Side note: What really makes KMeans faster is not the fact that euclidean distances are computed by chunks, it's because the chunked pairwise distance matrix fits in cache so it stays there for the following operations on this matrix (finding labels, partially update centers). So it does not apply to only computing euclidean distances. On the other hand, other metrics don't all have internal multi-threading, and probably none rely on level 3 BLAS routines. Usually computing pairwise distances does not involve a lot of computations and is quite fast, so parallelizing them with joblib has no benefit due to the joblib overhead being bigger than the computations themselves. Unless the data is big enough but memory issues will happen before that :) Those metrics could probably benefit from OpenMP based multithreading. About going too low-level, we already have this DistanceMetric module implementing all metrics in cython, so I'd say we're already kind of low-level and in that case, using OpenMP would really just be adding a 'p' before 'range' :) I think a good first step could be to move this module in metrics, where it really belongs, rework it to make it fused typed and sparse friendly, and add some prange. Obviously it will keep most of the API flaws that @jnothman exposed but it might set up a cleaner ground for future API changes. In the end, whatever you choose, I'd be happy to help. J?r?mie (@jeremiedbb) On 05/03/2020 22:12, Andreas Mueller wrote: > Thanks for a great summary of issues! > I agree there's lots to do, though I think most of the issues that you > list are quite hard and require thinking about API pretty hard. > So they might not be super amendable to being solved by a shorter-term > project. > > I was hoping there would be some more easy wins that we could get by > exploiting OpenMP better (or at all) in the distances. > Not sure if there is, though. > > I wonder if having a multicore implementation of euclidean_distances > would be useful for us, or if that's going too low-level. > > > > On 3/3/20 5:47 PM, Joel Nothman wrote: >> I noticed a comment by?@amueller on Gitter re?considering a project >> on our distances implementations. >> >> I think there's a lot of work that can be done in unifying distances >> implementations... (though I'm not always sure the benefit.) I >> thought I would?summarise some of the issues below, as I was unsure >> what Andy intended. >> >> As @jeremiedbb said, making n_jobs more effective would be >> beneficial. Reducing duplication between metrics.pairwise and >> neighbors._dist_metrics and kmeans would?be noble (especially with >> regard to parameters, where scicpy.spatial's mahalanobis available >> through sklearn.metrics does not accept V but sklearn.neighbors >> does). and perhaps offer higher consistency of results and efficiencies. >> >> We also have idioms the code like "if the metric is euclidean, use >> squared=True where we only need a ranking, then take the squareroot" >> while neighbors metrics abstract this with an API by providing rdsit >> and rdist_to_dist. >> >> There are issues about making sure that >> pairwise_distances(metric='minkowski', p=2) is using the same >> implementation as pairwise_distances(metric='euclidean'), etc. >> >> We have issues with chunking and distributing computations in the >> case that metric params are derived from the dataset (ideally a >> training?set). >> >> #16419 is a simple instance where the metric param is sample-aligned >> and needs to be chunked up. >> >> In other cases, we precompute some metric param over all the data, >> then pass it to each chunk worker, using _precompute_metric_params >> introduced in #12672. This is also relevant to #9555. >> >> While that initial implementation in #12672 is helpful and aims to >> maintain backwards compatibility, it makes some dubious choices. >> >> Firstly in terms of code structure it is not a very modular approach >> - each metric is handled with an if-then. Secondly, it *only* handles >> the chunking case, relying on the fact that these metrics are in >> scipy.spatial, and have a comparable handling of V=None and VI=None. >> In the Gower Distances PR (#9555) when implementing a metric locally, >> rather than relying on scipy.spatial, we needed to provide an >> implementation of these default parameters both when the data is >> chunked and when the metric function is called straight out. >> >> Thirdly, its approach to training vs test data is dubious. We don't >> formally label X and Y in pairwise_distances as train/test, and >> perhaps we should. Maintaining backwards compat with scipy's >> seuclidean and mahalanobis, our implementation stacks X and Y to each >> other if both are provided, and then calculates their variance. This >> means that users may be applying a different metric at train and at >> test time (if the variance of X as train and Y as test is >> substantially different), which I consider a silent error. We can >> either make the train/test nature of X and Y more explicit, or we can >> require that data-based parameters are given explicitly by the user >> and not implicitly computed. If I understand correctly, >> sklearn.neighbors will not compute V or VI for you, and it must be >> provided explicitly. (Requiring that the scaling of each feature be >> given explicitly in Gower seems like an unnecessary burden on the >> user, however.) >> >> Then there are issues like whether we should consistently set the >> diagonal to zero in all metrics where Y=None. >> >> In short, there are several projects in distances, and I'd support >> them being considered for work.... But it's a lot of engineering, if >> motivated by ML needs and consistency for users. >> >> J >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From adityaselfefficient at gmail.com Wed Mar 11 01:10:10 2020 From: adityaselfefficient at gmail.com (aditya aggarwal) Date: Wed, 11 Mar 2020 10:40:10 +0530 Subject: [scikit-learn] Understanding max_features parameter in RandomForestClassifier Message-ID: For RandomForestClassifier in sklearn max_features parameter gives the max no of features for split in random forest which is sqrt(n_features) as default. If m is sqrt of n, then no of combinations for DT formation is nCm. What if nCm is less than n_estimators (no of decision trees in random forest)? *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique combinations of features for decision trees. Now for n_estimators = 100, will the remaining 65 trees have repeated combination of features? If so, won't trees be correlated introducing bias in the answer? Thanks Aditya Aggarwal -------------- next part -------------- An HTML attachment was scrubbed... URL: From adityaselfefficient at gmail.com Wed Mar 11 01:22:22 2020 From: adityaselfefficient at gmail.com (aditya aggarwal) Date: Wed, 11 Mar 2020 10:52:22 +0530 Subject: [scikit-learn] Threshold for roc_curve in binary classification Message-ID: Hello I was going through the logic to calculate threshold to plot roc_curve. As far as I could understand, fps, tps and threshold is calculated in slklearn.metrics._binary_clf_curve . How are multiple values of threshold calculated for binary classification? Also what is happening in the following lines? distinct_value_indices = np.where(np.diff(y_score))[0] threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1] Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Wed Mar 11 01:26:50 2020 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Wed, 11 Mar 2020 14:26:50 +0900 Subject: [scikit-learn] Understanding max_features parameter in RandomForestClassifier In-Reply-To: References: Message-ID: Regardless of the number of features, each DT estimator is given only a subset of the data. Each DT estimator then uses the features to derive decision rules for the samples it was given. With more trees and few examples, you might get similar or identical trees, but that is not the norm. Pardon brevity. J.B. 2020?3?11?(?) 14:11 aditya aggarwal : > For RandomForestClassifier in sklearn > > max_features parameter gives the max no of features for split in random > forest which is sqrt(n_features) as default. If m is sqrt of n, then no of > combinations for DT formation is nCm. What if nCm is less than n_estimators > (no of decision trees in random forest)? > > *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique > combinations of features for decision trees. Now for n_estimators = 100, > will the remaining 65 trees have repeated combination of features? If so, > won't trees be correlated introducing bias in the answer? > > > Thanks > > Aditya Aggarwal > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adityaselfefficient at gmail.com Wed Mar 11 01:43:02 2020 From: adityaselfefficient at gmail.com (aditya aggarwal) Date: Wed, 11 Mar 2020 11:13:02 +0530 Subject: [scikit-learn] Understanding max_features parameter in RandomForestClassifier In-Reply-To: References: Message-ID: With all the parameters set to default, (especially bootstrap and max_samples), no of samples passed to each estimator is X.shape[0]. Doesn't it account for all the instances in the dataset with calculated no. of feature? Then how come only a subset is given to the estimator? On Wed, Mar 11, 2020 at 10:58 AM Brown J.B. via scikit-learn < scikit-learn at python.org> wrote: > Regardless of the number of features, each DT estimator is given only a > subset of the data. > Each DT estimator then uses the features to derive decision rules for the > samples it was given. > With more trees and few examples, you might get similar or identical > trees, but that is not the norm. > > Pardon brevity. > J.B. > > 2020?3?11?(?) 14:11 aditya aggarwal : > >> For RandomForestClassifier in sklearn >> >> max_features parameter gives the max no of features for split in random >> forest which is sqrt(n_features) as default. If m is sqrt of n, then no of >> combinations for DT formation is nCm. What if nCm is less than n_estimators >> (no of decision trees in random forest)? >> >> *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique >> combinations of features for decision trees. Now for n_estimators = 100, >> will the remaining 65 trees have repeated combination of features? If so, >> won't trees be correlated introducing bias in the answer? >> >> >> Thanks >> >> Aditya Aggarwal >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From venky.yuvy at gmail.com Wed Mar 11 04:18:27 2020 From: venky.yuvy at gmail.com (Venkatachalam N) Date: Wed, 11 Mar 2020 13:48:27 +0530 Subject: [scikit-learn] Understanding max_features parameter in RandomForestClassifier In-Reply-To: References: Message-ID: Hi Aditya, The sampling is done with replacement with the default settings. Hence, you will get different dataset even though you sample same number (`X.shape[0]`) of datapoints. Regards, Venkatachalam N. On Wed, Mar 11, 2020 at 11:14 AM aditya aggarwal < adityaselfefficient at gmail.com> wrote: > With all the parameters set to default, (especially bootstrap and > max_samples), no of samples passed to each estimator is X.shape[0]. Doesn't > it account for all the instances in the dataset with calculated no. of > feature? Then how come only a subset is given to the estimator? > > On Wed, Mar 11, 2020 at 10:58 AM Brown J.B. via scikit-learn < > scikit-learn at python.org> wrote: > >> Regardless of the number of features, each DT estimator is given only a >> subset of the data. >> Each DT estimator then uses the features to derive decision rules for the >> samples it was given. >> With more trees and few examples, you might get similar or identical >> trees, but that is not the norm. >> >> Pardon brevity. >> J.B. >> >> 2020?3?11?(?) 14:11 aditya aggarwal : >> >>> For RandomForestClassifier in sklearn >>> >>> max_features parameter gives the max no of features for split in random >>> forest which is sqrt(n_features) as default. If m is sqrt of n, then no of >>> combinations for DT formation is nCm. What if nCm is less than n_estimators >>> (no of decision trees in random forest)? >>> >>> *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 >>> unique combinations of features for decision trees. Now for n_estimators = >>> 100, will the remaining 65 trees have repeated combination of features? If >>> so, won't trees be correlated introducing bias in the answer? >>> >>> >>> Thanks >>> >>> Aditya Aggarwal >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gianmarcofucci94 at gmail.com Mon Mar 16 05:42:44 2020 From: gianmarcofucci94 at gmail.com (Gianmarco Fucci) Date: Mon, 16 Mar 2020 10:42:44 +0100 Subject: [scikit-learn] Study on annotation of design and implementation choices, and of technical debt Message-ID: Dear all, As software engineering research teams at the University of Sannio (Italy) and Eindhoven University of Technology (The Netherlands) we are interested in investigating the protocol used by developers while they have to annotate implementation and design choices during their normal development activities. More specifically, we are looking at whether, where and what kind of annotations developers usually use trying to be focused more on those annotations mainly aimed at highlighting that the code is not in the right shape (e.g., comments for annotating delayed or intended work activities such as TODO, FIXME, hack, workaround, etc). In the latter case, we are looking at what is the content of the above annotations, as well as how they usually behave while evolving the code that has been previously annotated. When answering the survey, in case your annotation practices are different in different open source projects you may contribute, please refer to how you behave for the projects where you have been contacted. Filling out the survey will take about 5 minutes. Please note that your identity and personal data will not be disclosed, while we plan to use the aggregated results and anonymized responses as part of a scientific publication. If you have any questions about the questionnaire or our research, please do not hesitate to contact us. You can find the survey link here: https://forms.gle/NxdVXiZQSmQ15U4T8 Thanks and regards, Gianmarco Fucci (gianmarcofucci94 at gmail.com) Fiorella Zampetti (fzampetti at unisannio.it) Alexander Serebrenik (a.serebrenik at tue.nl) Massimiliano Di Penta (dipenta at unisannio.it) -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Tue Mar 17 11:37:11 2020 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Tue, 17 Mar 2020 16:37:11 +0100 Subject: [scikit-learn] Announcing the 2020 John Hunter Excellence in Plotting Contest Message-ID: Dear all, I apologize for the cross-posting. In memory of John Hunter, we are pleased to announce the John Hunter Excellence in Plotting Contest for 2020. This open competition aims to highlight the importance of data visualization to scientific progress and showcase the capabilities of open source software. Participants are invited to submit scientific plots to be judged by a panel. The winning entries will be announced and displayed at SciPy 2020 or announced in the John Hunter Excellence in Plotting Contest website and youtube channel. John Hunter?s family are graciously sponsoring cash prizes for the winners in the following amounts: - 1st prize: $1000 - 2nd prize: $750 - 3rd prize: $500 - Entries must be submitted by June 1st to the form at https://forms.gle/SrexmkDwiAmDc7ej7 - Winners will be announced at Scipy 2020 in Austin, TX or publicly on the John Hunter Excellence in Plotting Contest website and youtube channel - Participants do not need to attend the Scipy conference. - Entries may take the definition of ?visualization? rather broadly. Entries may be, for example, a traditional printed plot, an interactive visualization for the web, a dashboard, or an animation. - Source code for the plot must be provided, in the form of Python code and/or a Jupyter notebook, along with a rendering of the plot in a widely used format. The rendering may be, for example, PDF for print, standalone HTML and Javascript for an interactive plot, or MPEG-4 for a video. If the original data can not be shared for reasons of size or licensing, "fake" data may be substituted, along with an image of the plot using real data. - Each entry must include a 300-500 word abstract describing the plot and its importance for a general scientific audience. - Entries will be judged on their clarity, innovation and aesthetics, but most importantly for their effectiveness in communicating a real-world problem. Entrants are encouraged to submit plots that were used during the course of research or work, rather than merely being hypothetical. - SciPy and the John Hunter Excellence in Plotting Contest organizers reserves the right to display any and all entries, whether prize-winning or not, at the conference, use in any materials or on its website, with attribution to the original author(s). - Past entries can be found at https://jhepc.github.io/ - Questions regarding the contest can be sent to jhepc.organizers at gmail.com John Hunter Excellence in Plotting Contest Co-Chairs Madicken Munk Nelle Varoquaux -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbc.develop at gmail.com Wed Mar 18 17:42:18 2020 From: jbc.develop at gmail.com (Juan BC) Date: Wed, 18 Mar 2020 18:42:18 -0300 Subject: [scikit-learn] The Coronavirus Tech Handbook Message-ID: Sorry for the offtopic https://coronavirustechhandbook.com/ <<<< The Coronavirus Tech Handbook provides a space for technologists, specialists, civic organisations and public & private institutions to collaborate on a rapid and sophisticated response to the coronavirus outbreak. It is a dynamic resource with many hundreds of contributors that is evolving very quickly. -- Juan B Cabral -------------- next part -------------- An HTML attachment was scrubbed... URL: From gk68118 at gmail.com Thu Mar 19 02:11:49 2020 From: gk68118 at gmail.com (Praneet Singh) Date: Thu, 19 Mar 2020 11:41:49 +0530 Subject: [scikit-learn] transfer learning doubt Message-ID: I am training a SGD Classifier with some training dataset which is temporary and will be lost after sometime. So I am planning to save the model in pickle file and reuse it and train again with some another dataset that arrives. But It forgets the previously learned data. As far as I researched in google, tensorflow model allows transfer learning and not forgetting the previous learning but is there any other way with sklearn model to achieve this?? any help would be appreciated -------------- next part -------------- An HTML attachment was scrubbed... URL: From fad469 at uregina.ca Thu Mar 19 09:19:38 2020 From: fad469 at uregina.ca (Farzana Anowar) Date: Thu, 19 Mar 2020 07:19:38 -0600 Subject: [scikit-learn] transfer learning doubt In-Reply-To: References: Message-ID: On 2020-03-19 00:11, Praneet Singh wrote: > I am training a SGD Classifier with some training dataset which is > temporary and will be lost after sometime. So I am planning to save > the model in pickle file and reuse it and train again with some > another dataset that arrives. But It forgets the previously learned > data. > > As far as I researched in google, tensorflow model allows transfer > learning and not forgetting the previous learning but is there any > other way with sklearn model to achieve this?? > any help would be appreciated > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn Did you use incremental estimator and partial _fit? If not, try to use them. Should work. Another option is to us deep learning and store the weights for the first model and initialize the second model with that weight and keep doing it for the rest of the models. -- Best Regards, Farzana Anowar, PhD Candidate Department of Computer Science University of Regina From rth.yurchak at gmail.com Thu Mar 19 10:06:37 2020 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Thu, 19 Mar 2020 15:06:37 +0100 Subject: [scikit-learn] transfer learning doubt In-Reply-To: References: Message-ID: <6ac57655-1ac5-34ea-416b-6c65d641ba7b@gmail.com> On 19/03/2020 14:19, Farzana Anowar wrote: > Another option is to us deep learning and store the weights for the first model and initialize the second model with that weight and keep doing it for the rest of the models. This can also be done in scikit-learn with models that support warm_start=True init parameter (including SGDClassifier). Roman From krallinger.martin at gmail.com Thu Mar 19 12:21:36 2020 From: krallinger.martin at gmail.com (Martin Krallinger) Date: Thu, 19 Mar 2020 17:21:36 +0100 Subject: [scikit-learn] Final CFP CodiEsp: Clinical Case Coding Task (eHealth CLEF 2020) In-Reply-To: References: Message-ID: *** Call for Participation CodiEsp: Clinical Case Coding Task (eHealth CLEF 2020) *** * *CodiEsp (eHealth CLEF? Multilingual Information Extraction) Shared Task on automatic assignment of ICD10 codes (procedures, diagnosis) track at CLEF 2020* http://temu.bsc.es/codiesp Plan TL Award for the CodiEsp Track The CodiEsp sub-tracks: *1.CodiEsp Diagnosis Coding *sub-task* (CodiEsp-D)*: will require automatic ICD10-CM [CIE10 Diagn?stico] code assignment to each clinical case document. *2.CodiEsp Procedure Coding *sub-task* (CodiEsp-P):* will require automatic ICD10-PCS [CIE10 Procedimiento] code assignment to each clinical case document. *3.CodiEsp Explainable AI *exploratory sub-task* (CodiEsp-X).* Systems are required to extract the evidence text supporting the predicted codes (both ICD10-CM and ICD10-PCS). *Task description* Clinical coding essentially requires the transformation (or classification) of medical texts into a structured or coded format using internationally recognized class codes. These codes describe a patient?s diagnosis or treatment. Clinical coding is critical for standardizing electronic clinical records; enable aetiology studies, monitor health trends, carry out epidemiology studies, clinical and biomedical research, assist clinical decision-making or even reimbursement. As part of the eHealth CLEF (http://clef-ehealth.org) Multilingual Information Extraction Shared Task we organize* CodiEsp: Clinical Case Coding Task (http://temu.bsc.es/codiesp ). *The CodiEsp task will address the automatic extraction and assignment of clinical coding (diagnosis and procedures) to clinical case documents in Spanish. To enable participation of researches around the world, in addition to the basic data in Spanish, we will also publish versions of the training, development, and test set *automatically translated into English*. Participating systems will be asked to automatically assign ICD10 codes (or CIE-10, in Spanish) to clinical case documents. Evaluation is done through comparison to manually assigned ICD10 codes. *Publications and workshop* As in previous eHealth CLEF efforts, there will be an *evaluation workshop allocated at CLEF 2020* where participating teams can present their systems and results. Moreover, participating teams will be invited to submit their system description papers for publication at the *CLEF 2020 Working Notes proceedings*. For previous working notes see: http://ceur-ws.org/Vol-2125/ *CodiEsp awards* There will we three awards for the top-scoring teams promoted by the Spanish Plan for the Advancement of Language Technology (Plan TL) and the Barcelona Supercomputing Center (BSC). -------------------------------------- *Participation and useful info* -------------------------------------- 1. CodiEsp web, info & detailed description:http://temu.bsc.es/codiesp/ 2. Registration for CodiEsp (Multilingual Information Extraction eHealth track):http://temu.bsc.es/codiesp/index.php/registration/ 3. Datasets:https://zenodo.org/record/3693570 4. Additional training resources:https://doi.org/10.5281/zenodo.3606662 ------------------------ *Main CodiEsp Track organizers* ------------------------ - *Martin Krallinger*, Barcelona Supercomputing Center. - *Antonio Miranda*, Barcelona Supercomputing Center. - *Aitor Gonzalez-Agirre*, Barcelona Supercomputing Center. - *Marta Villegas*, Barcelona Supercomputing Center. - *Jordi Armengol*, Barcelona Supercomputing Center. ------------------------ *Important Dates* ------------------------ Jan 13: Training and development set release March 2: Test and background set release May 3: End of evaluation May 5: Results notified May 24: Paper submission Jun 28: Camera-ready paper submission Sep 22-25: CLEF 2020 Conference (Thessaloniki, Greece) -------------- next part -------------- An HTML attachment was scrubbed... URL: From MC_George123 at hotmail.com Wed Mar 25 22:16:03 2020 From: MC_George123 at hotmail.com (MC_George123 at hotmail.com) Date: Thu, 26 Mar 2020 02:16:03 +0000 Subject: [scikit-learn] A basic question about kmeans algorithms elkan and llyod Message-ID: Hi admins, My team is working on optimization on scikit-learn staff now. When it comes to kmeans, I find there are two algorithms, one of which is lloyd and the other is elkan, which is the optimized one for lloyd using triangle inequality. In the older version of scikit-learn, elkan only supports dense dataset instead of sparse one. And in the latest version, elkan supports both type of datasets. So there is a question why both two algorithms are kept in kmeans since they do the almost same thing and elkan is a optimized one for lloyd. Are there any precision difference between two algorithms and how can I decide what algorithm to use? Best regards, George Fan -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Thu Mar 26 03:40:15 2020 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Thu, 26 Mar 2020 08:40:15 +0100 Subject: [scikit-learn] A basic question about kmeans algorithms elkan and llyod In-Reply-To: References: Message-ID: hi, I suspect Elkan is really winning when you have many centroids so the conclusion is not systematic my 2c Alex On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com < MC_George123 at hotmail.com> wrote: > Hi admins, > > > > My team is working on optimization on scikit-learn staff now. When it > comes to kmeans, I find there are two algorithms, one of which is lloyd and > the other is elkan, which is the optimized one for lloyd using triangle > inequality. In the older version of scikit-learn, elkan only supports > dense dataset instead of sparse one. And in the latest version, elkan > supports both type of datasets. So there is a question why both two > algorithms are kept in kmeans since they do the almost same thing and elkan > is a optimized one for lloyd. Are there any precision difference between > two algorithms and how can I decide what algorithm to use? > > > > Best regards, > > George Fan > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Thu Mar 26 15:59:25 2020 From: niourf at gmail.com (Nicolas Hug) Date: Thu, 26 Mar 2020 15:59:25 -0400 Subject: [scikit-learn] Monthly meetings Message-ID: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com> Hi all, The next scikit-learn monthly meeting will take place on Monday (https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=3&day=30&hour=11&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 ) While these meetings are mainly for core-devs to discuss the current topics, we're also happy to welcome non-core devs and other projects maintainers! Feel free to join. *Location:* Join Zoom Meeting https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09 Meeting ID: 947 129 165 Password: 586745 Thanks, Nicolas -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Mar 27 12:32:39 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 27 Mar 2020 12:32:39 -0400 Subject: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team Message-ID: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com> Hey all. There's a pretty cool paper by a team at MS that analyses public github repos for their use of the sklearn and related libraries: https://arxiv.org/abs/1912.09536 Thought it might be of interest. Cheers, Andy From t3kcit at gmail.com Fri Mar 27 12:36:52 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 27 Mar 2020 12:36:52 -0400 Subject: [scikit-learn] A basic question about kmeans algorithms elkan and llyod In-Reply-To: References: Message-ID: There's an interesting analysis in this paper: Fast K-Means with Accurate Bounds http://proceedings.mlr.press/v48/newling16.pdf On 3/26/20 3:40 AM, Alexandre Gramfort wrote: > hi, > > I suspect Elkan is really winning when you have many centroids > so the conclusion is not systematic > > my 2c > Alex > > > On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com > > wrote: > > Hi admins, > > My team is working on optimization on scikit-learn staff now. When > it comes to kmeans, I find there are two algorithms, one of which > is lloyd and the other is elkan, which is the optimized one for > lloyd using triangle inequality.? In the older version of > scikit-learn, elkan only supports dense dataset instead of sparse > one. And in the latest version, elkan supports both type of > datasets. So there is a question why both two algorithms are kept > in kmeans since they do the almost same thing and elkan is a > optimized one for lloyd. Are there any precision difference > between two algorithms and how can I decide what algorithm to use? > > Best regards, > > George Fan > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Fri Mar 27 13:10:28 2020 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Fri, 27 Mar 2020 18:10:28 +0100 Subject: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team In-Reply-To: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com> References: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com> Message-ID: Very interesting! A few comments, > From GH17, we managed to extract only 10.5k pipelines. The relatively low frequency (with respect to the number of notebooks using SCIKIT-LEARN [..]) indicates a non-wide adoption of this specification. However, the number of pipelines in the GH19 corpus is 132k pipelines (i.e., an increase of 13? [..] since 2017). It's nice to see that pipelines are indeed widely used. > Top-5 transformers [from imports] in GH19 are StandardScaler, CountVectorizer, TfidfTransformer, PolynomialFeatures, TfidfVectorizer (in this order). Same are the results for GH17 with the difference that PCA is instead of TfidfVectorizer. Hmm, I would have expected OneHotEncoder somewhere at the top and much less text processing. If there is real usage of CountVectorizer and TfidfTransformer separately, then maybe deprecating TfidfVectorizer could be done https://github.com/scikit-learn/scikit-learn/issues/14951 Though this ranking looks quite unexpected. I wonder if they have the full list and not just the top5. > Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression, MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this order). Maybe LinearRegression docstring should more strongly suggest to use Ridge with small regularization in practice. -- Roman On 27/03/2020 17:32, Andreas Mueller wrote: > Hey all. > There's a pretty cool paper by a team at MS that analyses public github > repos for their use of the sklearn and related libraries: > https://arxiv.org/abs/1912.09536 > > Thought it might be of interest. > > Cheers, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Fri Mar 27 18:20:17 2020 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 27 Mar 2020 23:20:17 +0100 Subject: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team In-Reply-To: References: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com> Message-ID: <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org> Thanks for the link Andy. This is indeed very interesting! On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote: > > Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression, > > MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this > > order). > Maybe LinearRegression docstring should more strongly suggest to use Ridge > with small regularization in practice. Yes! I actually wonder if we should not remove LinearRegression. It's a bit frightening me that so many people use it. The only time that I've seen it used in a scientific people, it was a mistake and it shouldn't have been used. I seldom advocate for deprecating :). G From pedro.cardoso.code at gmail.com Sun Mar 29 13:21:21 2020 From: pedro.cardoso.code at gmail.com (Pedro Cardoso) Date: Sun, 29 Mar 2020 18:21:21 +0100 Subject: [scikit-learn] [GridSearchCV] Reduction of elapsed time at the second interation Message-ID: Hello fellows, i am knew at slkearn and I have a question about GridSearchCV: I am running the following code at a jupyter notebook : ----------------------*code*------------------------------- opt_models = dict() for feature in [features1, features2, features3, features4]: cmb = CMB(x_train, y_train, x_test, y_test, feature) cmb.fit() cmb.predict() opt_models[str(feature)]=cmb.get_best_model() ------------------------------------------------------- The CMB class is just a class that contains different classification models (SVC, decision tree, etc...). When cmb.fit() is running, a gridSearchCV is performed at the SVC model (which is within the cmb instance) in order to tune the hyperparameters C, gamma, and kernel. The SCV model is implemented using the sklearn.svm.SVC class. Here is the output of the first and second iteration of the for loop: ---------------------*output*------------------------------------- -> 1st iteration Fitting 5 folds for each of 12 candidates, totalling 60 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 6.1s [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 6.1s [Parallel(n_jobs=-1)]: Done 3 tasks | elapsed: 6.1s [Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 6.2s [Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 6.2s [Parallel(n_jobs=-1)]: Done 6 tasks | elapsed: 6.2s [Parallel(n_jobs=-1)]: Done 7 tasks | elapsed: 6.2s [Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 6.2s [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 6.2s [Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 6.2s [Parallel(n_jobs=-1)]: Done 11 tasks | elapsed: 6.2s [Parallel(n_jobs=-1)]: Done 12 tasks | elapsed: 6.3s [Parallel(n_jobs=-1)]: Done 13 tasks | elapsed: 6.3s [Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 6.3s [Parallel(n_jobs=-1)]: Done 15 tasks | elapsed: 6.4s [Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 6.4s [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 6.4s [Parallel(n_jobs=-1)]: Done 18 tasks | elapsed: 6.4s [Parallel(n_jobs=-1)]: Done 19 tasks | elapsed: 6.5s [Parallel(n_jobs=-1)]: Done 20 tasks | elapsed: 6.5s [Parallel(n_jobs=-1)]: Done 21 tasks | elapsed: 6.5s [Parallel(n_jobs=-1)]: Done 22 tasks | elapsed: 6.6s [Parallel(n_jobs=-1)]: Done 23 tasks | elapsed: 6.7s [Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 6.7s [Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 6.7s [Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 6.8s [Parallel(n_jobs=-1)]: Done 27 tasks | elapsed: 6.8s [Parallel(n_jobs=-1)]: Done 28 tasks | elapsed: 6.9s [Parallel(n_jobs=-1)]: Done 29 tasks | elapsed: 6.9s [Parallel(n_jobs=-1)]: Done 30 tasks | elapsed: 6.9s [Parallel(n_jobs=-1)]: Done 31 tasks | elapsed: 7.0s [Parallel(n_jobs=-1)]: Done 32 tasks | elapsed: 7.0s [Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 7.0s [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 7.0s [Parallel(n_jobs=-1)]: Done 35 tasks | elapsed: 7.1s [Parallel(n_jobs=-1)]: Done 36 tasks | elapsed: 7.1s [Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 7.2s [Parallel(n_jobs=-1)]: Done 38 tasks | elapsed: 7.2s [Parallel(n_jobs=-1)]: Done 39 tasks | elapsed: 7.2s [Parallel(n_jobs=-1)]: Done 40 tasks | elapsed: 7.2s [Parallel(n_jobs=-1)]: Done 41 tasks | elapsed: 7.3s [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 7.3s [Parallel(n_jobs=-1)]: Done 43 tasks | elapsed: 7.3s [Parallel(n_jobs=-1)]: Done 44 tasks | elapsed: 7.4s [Parallel(n_jobs=-1)]: Done 45 tasks | elapsed: 7.4s [Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 7.5s -> 2nd iteration Fitting 5 folds for each of 12 candidates, totalling 60 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Batch computation too fast (0.0260s.) Setting batch_size=14. [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 3 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 60 out of 60 | elapsed: 0.7s finished --------------------------------------------------------------------------------------------------------------------- As you can see, the first iteration gets a elapsed time much larger than the 2nd iteration. Does it make sense? I am afraid that the model is doing some kind of cache or shortcut from the 1st iteration, and consequently could decrease the model training/performance? I already read the sklearn documentation and I didn't saw any warning/note about this kind of behaviour. Thank you very much for your time :) -------------- next part -------------- An HTML attachment was scrubbed... URL: From MC_George123 at hotmail.com Mon Mar 30 03:33:08 2020 From: MC_George123 at hotmail.com (=?utf-8?B?5qiKIOS5puWNjg==?=) Date: Mon, 30 Mar 2020 07:33:08 +0000 Subject: [scikit-learn] A basic question about kmeans algorithms elkan and llyod In-Reply-To: References: Message-ID: Hi, Thanks for your suggestion of the paper. However, the paper shows many more algorithms and finds out different algorithms show different performance on dataset with various dimensions, Lloyd algorithm not included. What I want to know is that can we remove the Lloyd algorithm in kmeans of scikit-learn since elkan is an optimized on with better performance. Best regards, George From: scikit-learn On Behalf Of Andreas Mueller Sent: Saturday, March 28, 2020 12:37 AM To: scikit-learn at python.org Subject: Re: [scikit-learn] A basic question about kmeans algorithms elkan and llyod There's an interesting analysis in this paper: Fast K-Means with Accurate Bounds http://proceedings.mlr.press/v48/newling16.pdf On 3/26/20 3:40 AM, Alexandre Gramfort wrote: hi, I suspect Elkan is really winning when you have many centroids so the conclusion is not systematic my 2c Alex On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com > wrote: Hi admins, My team is working on optimization on scikit-learn staff now. When it comes to kmeans, I find there are two algorithms, one of which is lloyd and the other is elkan, which is the optimized one for lloyd using triangle inequality. In the older version of scikit-learn, elkan only supports dense dataset instead of sparse one. And in the latest version, elkan supports both type of datasets. So there is a question why both two algorithms are kept in kmeans since they do the almost same thing and elkan is a optimized one for lloyd. Are there any precision difference between two algorithms and how can I decide what algorithm to use? Best regards, George Fan _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Mon Mar 30 07:02:14 2020 From: adrin.jalali at gmail.com (Adrin) Date: Mon, 30 Mar 2020 13:02:14 +0200 Subject: [scikit-learn] Monthly meetings In-Reply-To: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com> References: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com> Message-ID: Hi, The new meeting ID: https://anaconda.zoom.us/j/324780759?pwd=a1ROSFE2Nnc0cHBaeUtiVS93QnpHQT09 Meeting ID: 324 780 759 Password: 617892 On Thu, Mar 26, 2020 at 9:00 PM Nicolas Hug wrote: > Hi all, > > The next scikit-learn monthly meeting will take place on Monday ( > https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=3&day=30&hour=11&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 > > ) > > While these meetings are mainly for core-devs to discuss the current > topics, we're also happy to welcome non-core devs and other projects > maintainers! Feel free to join. > > > *Location:* > > Join Zoom Meeting > > https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09 > > Meeting ID: 947 129 165 > Password: 586745 > > > Thanks, > Nicolas > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Mar 30 07:03:44 2020 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 30 Mar 2020 13:03:44 +0200 Subject: [scikit-learn] Monthly meetings In-Reply-To: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com> References: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com> Message-ID: I get a message for an invalid meeting id. -- Olivier From t3kcit at gmail.com Mon Mar 30 10:30:09 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 30 Mar 2020 10:30:09 -0400 Subject: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team In-Reply-To: <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org> References: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com> <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org> Message-ID: <272061a7-0eda-dd2c-8666-a4be22a40e92@gmail.com> On 3/27/20 6:20 PM, Gael Varoquaux wrote: > Thanks for the link Andy. This is indeed very interesting! > > On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote: >>> Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression, >>> MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this >>> order). >> Maybe LinearRegression docstring should more strongly suggest to use Ridge >> with small regularization in practice. > Yes! I actually wonder if we should not remove LinearRegression. It's a > bit frightening me that so many people use it. The only time that I've > seen it used in a scientific people, it was a mistake and it shouldn't > have been used. > > I seldom advocate for deprecating :). > People use sklearn for inference. I'm not sure we should deprecate this usecase even though it's not our primary motivation. Also, there's an inconsistency here: Logistic Regression has an L2 penalty by default (to the annoyance of some), while Linear Regression does not. We have discussed the meaning of the different classes for linear models several times, they are certainly not consistent (ridge, lasso and lr are three classes for squared loss while all three are in LogisticRegression for the log loss). I think to many "use statsmodels" is not a satisfying answer. I have seen people argue that linear regression or logistic regression should throw an error on colinear data, and I think that's not in the spirit of sklearn (even though we had this as a warning in discriminant analysis until recently). But we should probably have more clear signaling about this. Our documentation doesn't really emphasize the prediction vs inference point enough, I think. Btw, we could also make our linear regression more stable by using the minimum norm solution via the SVD. From t3kcit at gmail.com Mon Mar 30 10:35:43 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 30 Mar 2020 10:35:43 -0400 Subject: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team In-Reply-To: <272061a7-0eda-dd2c-8666-a4be22a40e92@gmail.com> References: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com> <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org> <272061a7-0eda-dd2c-8666-a4be22a40e92@gmail.com> Message-ID: <71befd21-75b6-6370-2416-a5b01225c492@gmail.com> Also see https://github.com/scikit-learn/scikit-learn/issues/14268 which is discussing how to make things faster *and* more stable! On 3/30/20 10:30 AM, Andreas Mueller wrote: > > > On 3/27/20 6:20 PM, Gael Varoquaux wrote: >> Thanks for the link Andy. This is indeed very interesting! >> >> On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote: >>>> Regarding learners, Top-5 in both GH17 and GH19 are >>>> LogisticRegression, >>>> MultinomialNB, SVC, LinearRegression, and RandomForestClassifier >>>> (in this >>>> order). >>> Maybe LinearRegression docstring should more strongly suggest to use >>> Ridge >>> with small regularization in practice. >> Yes! I actually wonder if we should not remove LinearRegression. It's a >> bit frightening me that so many people use it. The only time that I've >> seen it used in a scientific people, it was a mistake and it shouldn't >> have been used. >> >> I seldom advocate for deprecating :). >> > > People use sklearn for inference. I'm not sure we should deprecate > this usecase even though it's not > our primary motivation. > > Also, there's an inconsistency here: Logistic Regression has an L2 > penalty by default (to the annoyance of some), > while Linear Regression does not. We have discussed the meaning of the > different classes for linear models several times, > they are certainly not consistent (ridge, lasso and lr are three > classes for squared loss while all three are in LogisticRegression for > the log loss). > > I think to many "use statsmodels" is not a satisfying answer. > > I have seen people argue that linear regression or logistic regression > should throw an error on colinear data, and I think that's not in the > spirit of sklearn > (even though we had this as a warning in discriminant analysis until > recently). > But we should probably have more clear signaling about this. > > Our documentation doesn't really emphasize the prediction vs inference > point enough, I think. > > Btw, we could also make our linear regression more stable by using the > minimum norm solution via the SVD. From t3kcit at gmail.com Mon Mar 30 15:03:58 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 30 Mar 2020 15:03:58 -0400 Subject: [scikit-learn] A basic question about kmeans algorithms elkan and llyod In-Reply-To: References: Message-ID: <1982d76e-554e-0770-3eb1-e970f3a9e983@gmail.com> sorry I thought it also did experiements on what they call "sta" but I guess they are not included. The conclusion is the same, though. Different algorithms show different performance on different datasets. The Yingyang k-means has some elkan vs lloyd figures: http://proceedings.mlr.press/v37/ding15.pdf In table 2, the Elkan row, in cases the speedup is <1, it means elkans is slower than lloyd. Elkans is also more memory intensive, so you can see some missing values in that where the computation couldn't be performed, but lloyd could. On 3/30/20 3:33 AM, ? ?? wrote: > > Hi, > > Thanks for your suggestion of the paper. However, the paper shows many > more algorithms and finds out different algorithms show different > performance on dataset with various dimensions, Lloyd algorithm not > included. What I want to know is that can we remove the Lloyd > algorithm in kmeans of scikit-learn since elkan is an optimized on > with better performance. > > Best regards, > > George > > *From:* scikit-learn > *On Behalf > Of *Andreas Mueller > *Sent:* Saturday, March 28, 2020 12:37 AM > *To:* scikit-learn at python.org > *Subject:* Re: [scikit-learn] A basic question about kmeans algorithms > elkan and llyod > > There's an interesting analysis in this paper: > Fast K-Means with Accurate Bounds > > http://proceedings.mlr.press/v48/newling16.pdf > > On 3/26/20 3:40 AM, Alexandre Gramfort wrote: > > hi, > > I suspect Elkan is really winning when you have many centroids > > so the conclusion is not systematic > > my 2c > > Alex > > On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com > > wrote: > > Hi admins, > > My team is working on optimization on scikit-learn staff now. > When it comes to kmeans, I find there are two algorithms, one > of which is lloyd and the other is elkan, which is the > optimized one for lloyd using triangle inequality.? In the > older version of scikit-learn, elkan only supports dense > dataset instead of sparse one. And in the latest version, > elkan supports both type of datasets. So there is a question > why both two algorithms are kept in kmeans since they do the > almost same thing and elkan is a optimized one for lloyd. Are > there any precision difference between two algorithms and how > can I decide what algorithm to use? > > Best regards, > > George Fan > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From MC_George123 at hotmail.com Tue Mar 31 03:49:45 2020 From: MC_George123 at hotmail.com (=?utf-8?B?5qiKIOS5puWNjg==?=) Date: Tue, 31 Mar 2020 07:49:45 +0000 Subject: [scikit-learn] A basic question about kmeans algorithms elkan and llyod In-Reply-To: <1982d76e-554e-0770-3eb1-e970f3a9e983@gmail.com> References: <1982d76e-554e-0770-3eb1-e970f3a9e983@gmail.com> Message-ID: Thank you very much for your information. From: scikit-learn On Behalf Of Andreas Mueller Sent: Tuesday, March 31, 2020 3:04 AM To: scikit-learn at python.org Subject: Re: [scikit-learn] A basic question about kmeans algorithms elkan and llyod sorry I thought it also did experiements on what they call "sta" but I guess they are not included. The conclusion is the same, though. Different algorithms show different performance on different datasets. The Yingyang k-means has some elkan vs lloyd figures: http://proceedings.mlr.press/v37/ding15.pdf In table 2, the Elkan row, in cases the speedup is <1, it means elkans is slower than lloyd. Elkans is also more memory intensive, so you can see some missing values in that where the computation couldn't be performed, but lloyd could. On 3/30/20 3:33 AM, ? ?? wrote: Hi, Thanks for your suggestion of the paper. However, the paper shows many more algorithms and finds out different algorithms show different performance on dataset with various dimensions, Lloyd algorithm not included. What I want to know is that can we remove the Lloyd algorithm in kmeans of scikit-learn since elkan is an optimized on with better performance. Best regards, George From: scikit-learn On Behalf Of Andreas Mueller Sent: Saturday, March 28, 2020 12:37 AM To: scikit-learn at python.org Subject: Re: [scikit-learn] A basic question about kmeans algorithms elkan and llyod There's an interesting analysis in this paper: Fast K-Means with Accurate Bounds http://proceedings.mlr.press/v48/newling16.pdf On 3/26/20 3:40 AM, Alexandre Gramfort wrote: hi, I suspect Elkan is really winning when you have many centroids so the conclusion is not systematic my 2c Alex On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com > wrote: Hi admins, My team is working on optimization on scikit-learn staff now. When it comes to kmeans, I find there are two algorithms, one of which is lloyd and the other is elkan, which is the optimized one for lloyd using triangle inequality. In the older version of scikit-learn, elkan only supports dense dataset instead of sparse one. And in the latest version, elkan supports both type of datasets. So there is a question why both two algorithms are kept in kmeans since they do the almost same thing and elkan is a optimized one for lloyd. Are there any precision difference between two algorithms and how can I decide what algorithm to use? Best regards, George Fan _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From benoit.presles at u-bourgogne.fr Tue Mar 31 09:48:50 2020 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Tue, 31 Mar 2020 15:48:50 +0200 Subject: [scikit-learn] Number of informative features vs total number of features Message-ID: <10c2473f-50e3-c959-b9f7-07c2b903c840@u-bourgogne.fr> Dear sklearn users, I did some supervised classification simulations with the make_classification function from sklearn increasing the number of informative features from 1 out of 40 to 40 out of 40 (100%). I did not generate any repeated or redundant features. I fixed the number of classes to two and the number of clusters per class to one. I split the dataset 100 times using the StratifiedShuffleSplit function into two subsets: a training set and a test set (80% - 20%). I performed a logistic regression and calculated training and testing accuracies and averaged the results over the 100 splits leading to a mean training accuracy and a mean testing accuracy. I was expecting to get an increasing accuracy score as a function of informative features for both the training and the test sets. On the contrary, I have got the best training and test scores for one informative feature. Why do I get these results ? Thanks for your help, Best regards, Ben Below the simulation code I have written: import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedShuffleSplit from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt RANDOM_SEED = 4 n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40]) mean_training_score_array = np.array([]) mean_testing_score_array = np.array([]) for n_inf_value in n_inf: ??? X, y = make_classification(n_samples=2500, ?????????????????????????????? n_features=40, ?????????????????????????????? n_informative=n_inf_value, ?????????????????????????????? n_redundant=0, ?????????????????????????????? n_repeated=0, ?????????????????????????????? n_classes=2, ?????????????????????????????? n_clusters_per_class=1, ?????????????????????????????? random_state=RANDOM_SEED, ?????????????????????????????? shuffle=False) ??? # ??? print('Simulated data - number of informative features = ' + str(n_inf_value)) ??? # ??? sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, random_state=RANDOM_SEED) ??? training_score_array = np.array([]) ??? testing_score_array = np.array([]) ??? for train_index_split, test_index_split in sss.split(X, y): ??????? X_split_train, X_split_test = X[train_index_split], X[test_index_split] ??????? y_split_train, y_split_test = y[train_index_split], y[test_index_split] ??????? scaler = StandardScaler() ??????? X_split_train = scaler.fit_transform(X_split_train) ??????? X_split_test = scaler.transform(X_split_test) ??????? lr = LogisticRegression(fit_intercept=True, max_iter=1e9, verbose=0, ??????????????????????????????? random_state=RANDOM_SEED, solver='lbfgs', tol=1e-6, C=10) ??????? lr.fit(X_split_train, y_split_train) ??????? y_pred_train = lr.predict(X_split_train) ??????? y_pred_test = lr.predict(X_split_test) ??????? accuracy_train_score = accuracy_score(y_split_train, y_pred_train) ??????? accuracy_test_score = accuracy_score(y_split_test, y_pred_test) ??????? training_score_array = np.append(training_score_array, accuracy_train_score) ??????? testing_score_array = np.append(testing_score_array, accuracy_test_score) ??? mean_training_score_array = np.append(mean_training_score_array, np.average(training_score_array)) ??? mean_testing_score_array = np.append(mean_testing_score_array, np.average(testing_score_array)) # print('mean_training_score_array=' + str(mean_training_score_array)) print('mean_testing_score_array=' + str(mean_testing_score_array)) # plt.plot(n_inf, mean_training_score_array, 'r', label='mean training score') plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing score') plt.xlabel('number of informative features out of 40') plt.ylabel('accuracy') plt.legend() plt.show()