From nils106 at googlemail.com  Tue Mar  3 02:49:51 2020
From: nils106 at googlemail.com (Nils Wagner)
Date: Tue, 3 Mar 2020 08:49:51 +0100
Subject: [scikit-learn] tensorflow and scikit-learn
Message-ID: <CAE7vhgMxe0oKNQGVpArROe9fb4C39AaSBR_hMG9aygCM53totg@mail.gmail.com>

Hi All,

I am newbie to scikit-learn. Is it possible to use scikit-learn instead of
tensorflow and keras in the attached script?

Best regards,
                            Nils
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200303/ce1a7500/attachment.html>
-------------- next part --------------
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import random
import math

import numpy as np
np.random.seed(1)
#
# ModuleNotFoundError: No module named 'tensorflow'
#
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

def Amplitude(omega, zeta): 
    """Analytic amplitude calculation"""
    A = 1/math.sqrt((1-omega**2)**2+(2*zeta*omega)**2)
    return A

zeta_0 = 0.1   # Damping ratio
w_min = 0.0    # Start frequency
w_max = 10.0   # End frequency
N_omega = 300  # Number of points per interval    

w = np.linspace(w_min, w_max, N_omega).reshape(-1, 1)
Amplitude = np.vectorize(Amplitude)
a = Amplitude(w, zeta_0)

rnd_indices = np.random.rand(len(w)) < 0.80

x_train = w[rnd_indices]
y_train = a[rnd_indices]
x_test = w[~rnd_indices]
y_test = a[~rnd_indices]

print (x_train)
print (x_test)
input('Press enter to continue')

# Create a model
def baseline_model():
    height = 100
    model = Sequential()    
    model.add(Dense(height, input_dim=1, activation='tanh', kernel_initializer='uniform'))
    model.add(Dense(height, input_dim=height, activation='tanh', kernel_initializer='uniform'))
    model.add(Dense(height, input_dim=height, activation='tanh', kernel_initializer='uniform'))
    model.add(Dense(1, input_dim=height, activation='linear', kernel_initializer='uniform'))
    
    sgd = SGD(lr=0.01, momentum=0.9, nesterov=True)
    model.compile(loss='mse', optimizer=sgd)
    return model

# Training the model
model = baseline_model()
model.fit(x_train, y_train, epochs=1000, verbose = 0)
plt.figure(figsize=(16,8))
plt.rcParams["font.family"] = "arial"
plt.rcParams["font.size"] = "18"

plt.semilogy(x_test, model.predict(x_test), 'og')
plt.semilogy(x_train, model.predict(x_train), 'r')
plt.semilogy(w, a, 'b')

plt.xlabel('Driving Angular Frequency [Hz]')
plt.ylabel('Amplitude [m]')
plt.title('Oscillator Amplitude vs Driving Angular Frequency')
plt.legend(['TensorFlow Test', 'TensorFlow Training', 'Analytic Solution'])
plt.show()

From niourf at gmail.com  Tue Mar  3 07:36:41 2020
From: niourf at gmail.com (Nicolas Hug)
Date: Tue, 3 Mar 2020 07:36:41 -0500
Subject: [scikit-learn] tensorflow and scikit-learn
In-Reply-To: <CAE7vhgMxe0oKNQGVpArROe9fb4C39AaSBR_hMG9aygCM53totg@mail.gmail.com>
References: <CAE7vhgMxe0oKNQGVpArROe9fb4C39AaSBR_hMG9aygCM53totg@mail.gmail.com>
Message-ID: <a6ddd257-688e-d06a-d529-8fde1960a71c@gmail.com>

Hi Nils,

 From a quick glance it looks like you're building a fully connected 
multi-layer perceptron so yes, this is possible in scikit-learn with the 
neural_network module (check out the docs). The script would be quite 
different though, it's not just plug and play. Also, for anything more 
complex in neural nets, we would not recommend scikit-learn.

Nicolas

On 3/3/20 2:49 AM, Nils Wagner via scikit-learn wrote:
> Hi All,
>
> I am newbie to scikit-learn. Is it possible to use scikit-learn 
> instead of tensorflow and keras in the attached script?
>
> Best regards,
> ??????????????????????????? Nils
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200303/6ac4812f/attachment.html>

From adrin.jalali at gmail.com  Tue Mar  3 08:19:19 2020
From: adrin.jalali at gmail.com (Adrin)
Date: Tue, 3 Mar 2020 14:19:19 +0100
Subject: [scikit-learn] tensorflow and scikit-learn
In-Reply-To: <a6ddd257-688e-d06a-d529-8fde1960a71c@gmail.com>
References: <CAE7vhgMxe0oKNQGVpArROe9fb4C39AaSBR_hMG9aygCM53totg@mail.gmail.com>
 <a6ddd257-688e-d06a-d529-8fde1960a71c@gmail.com>
Message-ID: <CAEOrW49WfaUkD6AA2tBsVbt5KFvDerOXwzzVn--1Kt4O_yEf1A@mail.gmail.com>

skorch is another nice library to do DL in sklearn based
environments/workflows.

On Tue., Mar. 3, 2020, 13:37 Nicolas Hug, <niourf at gmail.com> wrote:

> Hi Nils,
>
> From a quick glance it looks like you're building a fully connected
> multi-layer perceptron so yes, this is possible in scikit-learn with the
> neural_network module (check out the docs). The script would be quite
> different though, it's not just plug and play. Also, for anything more
> complex in neural nets, we would not recommend scikit-learn.
>
> Nicolas
> On 3/3/20 2:49 AM, Nils Wagner via scikit-learn wrote:
>
> Hi All,
>
> I am newbie to scikit-learn. Is it possible to use scikit-learn instead of
> tensorflow and keras in the attached script?
>
> Best regards,
>                             Nils
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200303/55b1b9dd/attachment.html>

From joel.nothman at gmail.com  Tue Mar  3 17:47:46 2020
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 4 Mar 2020 09:47:46 +1100
Subject: [scikit-learn] distances
Message-ID: <CAAkaFLWFCsE9Qt1QMRu41NjsNrK7D0h+u2ANEPFmbG5ZNeXL6Q@mail.gmail.com>

I noticed a comment by @amueller on Gitter re considering a project on our
distances implementations.

I think there's a lot of work that can be done in unifying distances
implementations... (though I'm not always sure the benefit.) I thought I
would summarise some of the issues below, as I was unsure what Andy
intended.

As @jeremiedbb said, making n_jobs more effective would be beneficial.
Reducing duplication between metrics.pairwise and neighbors._dist_metrics
and kmeans would be noble (especially with regard to parameters, where
scicpy.spatial's mahalanobis available through sklearn.metrics does not
accept V but sklearn.neighbors does). and perhaps offer higher consistency
of results and efficiencies.

We also have idioms the code like "if the metric is euclidean, use
squared=True where we only need a ranking, then take the squareroot" while
neighbors metrics abstract this with an API by providing rdsit and
rdist_to_dist.

There are issues about making sure that
pairwise_distances(metric='minkowski', p=2) is using the same
implementation as pairwise_distances(metric='euclidean'), etc.

We have issues with chunking and distributing computations in the case that
metric params are derived from the dataset (ideally a training set).

#16419 is a simple instance where the metric param is sample-aligned and
needs to be chunked up.

In other cases, we precompute some metric param over all the data, then
pass it to each chunk worker, using _precompute_metric_params introduced in
#12672. This is also relevant to #9555.

While that initial implementation in #12672 is helpful and aims to maintain
backwards compatibility, it makes some dubious choices.

Firstly in terms of code structure it is not a very modular approach - each
metric is handled with an if-then. Secondly, it *only* handles the chunking
case, relying on the fact that these metrics are in scipy.spatial, and have
a comparable handling of V=None and VI=None. In the Gower Distances PR
(#9555) when implementing a metric locally, rather than relying on
scipy.spatial, we needed to provide an implementation of these default
parameters both when the data is chunked and when the metric function is
called straight out.

Thirdly, its approach to training vs test data is dubious. We don't
formally label X and Y in pairwise_distances as train/test, and perhaps we
should. Maintaining backwards compat with scipy's seuclidean and
mahalanobis, our implementation stacks X and Y to each other if both are
provided, and then calculates their variance. This means that users may be
applying a different metric at train and at test time (if the variance of X
as train and Y as test is substantially different), which I consider a
silent error. We can either make the train/test nature of X and Y more
explicit, or we can require that data-based parameters are given explicitly
by the user and not implicitly computed. If I understand correctly,
sklearn.neighbors will not compute V or VI for you, and it must be provided
explicitly. (Requiring that the scaling of each feature be given explicitly
in Gower seems like an unnecessary burden on the user, however.)

Then there are issues like whether we should consistently set the diagonal
to zero in all metrics where Y=None.

In short, there are several projects in distances, and I'd support them
being considered for work.... But it's a lot of engineering, if motivated
by ML needs and consistency for users.

J
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200304/e2e6d3cf/attachment.html>

From jeremie.du-boisberranger at inria.fr  Wed Mar  4 13:20:43 2020
From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger)
Date: Wed, 4 Mar 2020 19:20:43 +0100
Subject: [scikit-learn] ANN: scikit-learn 0.22.2.post1
In-Reply-To: <CAEOrW4_MEx_ZTbmNkakJusukZ7g6txcrjz+0o2WGn4d9uoBtiQ@mail.gmail.com>
References: <CAEOrW4_MEx_ZTbmNkakJusukZ7g6txcrjz+0o2WGn4d9uoBtiQ@mail.gmail.com>
Message-ID: <f47b925a-c644-7006-98dc-e99a235d072a@inria.fr>

This is a minor release including a few bug fixes. Here is the full
changelog:

https://scikit-learn.org/stable/whats_new/v0.22.html#version-0-22-2

The 0.22.2.post1 release includes a packaging fix for the source distribution
but the content of the packages is otherwise identical to the content of the
wheels with the 0.22.2 version (without the .post1 suffix).

Thank you very much to all who contributed to this release !

Regards,
J?r?mie, on behalf of the scikit-learn maintainer team.


From rawtevipula25 at gmail.com  Thu Mar  5 10:00:44 2020
From: rawtevipula25 at gmail.com (Vipula Rawte)
Date: Thu, 5 Mar 2020 10:00:44 -0500
Subject: [scikit-learn] Getting identical mse, r2, mae for different data
Message-ID: <CAMX56PhB8n8AZ3rQccHNDdsMgzCVygZiaxNAtP0Udfi1ErL_1w@mail.gmail.com>

I am getting identical metric evaluation values for different data, I
printed the matrix shape too.

Below is a screenshot:

[image: image.png]

-- 
Regards,
Vipula Rawte
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/f4969177/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126984 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/f4969177/attachment-0001.png>

From fhjaime96 at gmail.com  Thu Mar  5 10:06:44 2020
From: fhjaime96 at gmail.com (Jaime Ferrando Huertas)
Date: Thu, 5 Mar 2020 16:06:44 +0100
Subject: [scikit-learn] Getting identical mse, r2, mae for different data
In-Reply-To: <CAMX56PhB8n8AZ3rQccHNDdsMgzCVygZiaxNAtP0Udfi1ErL_1w@mail.gmail.com>
References: <CAMX56PhB8n8AZ3rQccHNDdsMgzCVygZiaxNAtP0Udfi1ErL_1w@mail.gmail.com>
Message-ID: <CAK-ObqtZWzZeY=s=r+uRKi0F+wyb7q2PhzOF9b6q8tUSHEKHOA@mail.gmail.com>

Can you provide the code that produces this output?

El jue., 5 mar. 2020 a las 16:03, Vipula Rawte (<rawtevipula25 at gmail.com>)
escribi?:

> I am getting identical metric evaluation values for different data, I
> printed the matrix shape too.
>
> Below is a screenshot:
>
> [image: image.png]
>
> --
> Regards,
> Vipula Rawte
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/6919bcfa/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126984 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/6919bcfa/attachment-0001.png>

From rawtevipula25 at gmail.com  Thu Mar  5 12:07:27 2020
From: rawtevipula25 at gmail.com (Vipula Rawte)
Date: Thu, 5 Mar 2020 12:07:27 -0500
Subject: [scikit-learn] Getting identical mse, r2, mae for different data
In-Reply-To: <CAK-ObqtZWzZeY=s=r+uRKi0F+wyb7q2PhzOF9b6q8tUSHEKHOA@mail.gmail.com>
References: <CAMX56PhB8n8AZ3rQccHNDdsMgzCVygZiaxNAtP0Udfi1ErL_1w@mail.gmail.com>
 <CAK-ObqtZWzZeY=s=r+uRKi0F+wyb7q2PhzOF9b6q8tUSHEKHOA@mail.gmail.com>
Message-ID: <CAMX56Pg8N-BXvgTvnmb=C0ycrVm9srNyDs9-8hN-Pf0nE6GLBA@mail.gmail.com>

import os
import sys
import csv
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score,
mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
import re
import numpy as np
from sklearn.svm import SVR
import time
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score,
cross_val_predict
from sklearn import metrics
import copy
from multiscorer import MultiScorer

start = time.time()

#print("metrics: ", metrics.SCORERS.keys())

mae_file = open('mae_scores.txt', 'w')
mse_file = open('mse_scores.txt', 'w')
r2_file = open('r2_scores.txt', 'w')

def tokenizer(text):
    if text:
        result = re.findall('[a-z]{2,}', text.lower())
    else:
        result = []
    return result


def tfidf_vect(X):
vect = TfidfVectorizer(tokenizer=tokenizer, stop_words='english')
v = vect.fit(X)
X_vect  = v.transform(X)
return X_vect

def compute(X_vect,y):

scorer = MultiScorer({
   'r2'  : (r2_score , {}),
   'mse' : (mean_squared_error, {}),
   'mae' : (mean_absolute_error, {})

   })

#SVR model
model = SVR(C=1.0, epsilon=0.2, kernel= "poly")

X_train, X_test, y_train, y_test = train_test_split(X_vect, y,
test_size=0.33, shuffle=False, random_state=42)

model.fit(X_train, y_train)
pred = model.predict(X_test)

print("mse: ", mean_squared_error(pred, y_test))
print("mae: ", mean_absolute_error(pred, y_test))
print("r2_score: ", r2_score(pred, y_test))


'''
# Perform 6-fold cross validation
scores = cross_val_score(model, X_vect, y, cv=10, scoring=scorer)
results = scorer.get_results()

print("len: ", X_vect.shape[0])

final_scores = []

for metric_name in results.keys():
average_score = np.average(results[metric_name])
print('%s : %f' % (metric_name, average_score))
final_scores.append(average_score)

r2_file.write(str(final_scores[0]) + '\n')
mse_file.write(str(final_scores[1]) + '\n')
mae_file.write(str(final_scores[2]) + '\n')
'''


'''
df_header = ['cik_year', 'words', 'sent_words', 'roa', 'eps', 'tobinq',
'tier1_c', 'leverage', 'Z_score_c']

#10K_t+1
df1 = pd.read_csv("list_10K_next.txt", header=None, usecols=[0],
names=['cik_year'])
df21 = pd.read_csv("train_2006_2011_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df22 = pd.read_csv("test_2006_2011_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df23 = pd.read_csv("train_2007_2012_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df24 = pd.read_csv("test_2007_2012_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df2 = pd.concat([df21, df22, df23, df24])
df5 = df2.copy()
searchfor1 = df1['cik_year'].values.tolist()
df2 = df2[df2.cik_year.str.contains('|'.join(searchfor1))].reset_index()
del df2['index']

#all_perf_indicators
basepath1 = "/data/ftm/xgb_regr/ch_an_data/bank_all_perf_ind_data/"
dp11 = pd.read_csv(basepath1 + "train_2007_2012.csv")
dp12 = pd.read_csv(basepath1 + "test_2007_2012.csv")
dp1 = pd.concat([dp11, dp12])
searchfor1 = df1['cik_year'].values.tolist()
dp1 = dp1[dp1.cik_year.str.contains('|'.join(searchfor1))].reset_index()
del dp1['index']
dp1 = dp1.drop_duplicates()

df2 = pd.merge(df2, dp1)
df2 = df2.drop_duplicates()


df2['prev_cik_year'] = df2['cik_year'].apply(lambda x: x.split("_")[0] +
"_" + str(int(x.split("_")[1]) - 1))

#8K_t
df3 = pd.read_csv("list_8K.txt", header=None, usecols=[0],
names=['cik_year'])
df41 = pd.read_csv("train_8K_2006_2011_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df42 = pd.read_csv("test_8K_2006_2011_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df43 = pd.read_csv("train_8K_2007_2012_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df44 = pd.read_csv("test_8K_2007_2012_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df4 = pd.concat([df41, df42, df43, df44])

searchfor1 = df3['cik_year'].values.tolist()
df4 = df4[df4.cik_year.str.contains('|'.join(searchfor1))].reset_index()
del df4['index']
df4 = pd.merge(df4, df2, left_on='cik_year', right_on='prev_cik_year')
df4 = df4.drop_duplicates()
df4 = df4.rename({'cik_year_x':'cik_year', 'mda_words_x':'words',
'mda_sent_words_x':'sent_words', 'scaled_roa_y': 'roa', 'eps_scaled':
'eps', 'tobinq_scaled': 'tobinq', 'tier1_c_scaled': 'tier1_c',
'leverage_scaled': 'leverage', 'Z_score_c_scaled': 'Z_score_c'}, axis=1)
df4.to_csv("8K_t.csv", columns=df_header)

#10K_t
searchfor1 = df3['cik_year'].values.tolist()
df5 = df5[df5.cik_year.str.contains('|'.join(searchfor1))].reset_index()
del df5['index']
df5 = pd.merge(df5, df2, left_on='cik_year', right_on='prev_cik_year')
df5 = df5.drop_duplicates()
df5 = df5.rename({'cik_year_x':'cik_year', 'mda_words_x':'words',
'mda_sent_words_x':'sent_words', 'scaled_roa_x': 'roa', 'eps_prev_scaled':
'eps', 'tobinq_prev_scaled': 'tobinq', 'tier1_c_prev_scaled': 'tier1_c',
'leverage_prev_scaled': 'leverage', 'Z_score_c_prev_scaled': 'Z_score_c'},
axis=1)

df5.to_csv("10K_t.csv", columns=df_header)

df2 = df2.rename({'mda_words':'words', 'mda_sent_words':'sent_words',
'scaled_roa': 'roa', 'eps_scaled': 'eps', 'tobinq_scaled': 'tobinq',
'tier1_c_scaled': 'tier1_c', 'leverage_scaled': 'leverage',
'Z_score_c_scaled': 'Z_score_c'}, axis=1)
df2.to_csv("10K_t1.csv", columns=df_header)
'''


#print("after 8K: ", len(df2), len(df4), len(df5), list(df2), list(df4),
list(df5))
'''
df_10K_t1 = pd.read_csv("10K_t1.csv")
df_10K_t = pd.read_csv("10K_t.csv")

word_type = ['words', 'sent_words']
target = ['roa', 'eps', 'tobinq', 'tier1_c', 'leverage', 'Z_score_c']

for t in target:
print(t)
print(df_10K_t1[t])
print(df_10K_t[t])


for w in word_type:
for t in target:

print("w: ", w, "t: ", t)

#8K
print("8K")
df_8K_t = pd.read_csv("8K_t.csv")
X_8K = df_8K_t[w].values.astype('U')
y_8K = df_8K_t[t]
X_vect_8K = tfidf_vect(X_8K)
compute(X_vect_8K, y_8K)


#10K_t+1
print("10K_t+1")
df_10K_t1 = pd.read_csv("10K_t1.csv")
X_10K1 = df_10K_t1[w].values.astype('U')
y_10K1 = df_10K_t1[t]
X_vect_10K1 = tfidf_vect(X_10K1)
compute(X_vect_10K1, y_10K1)


#10K_t
print("10K_t")
df_10K_t = pd.read_csv("10K_t.csv")
X_10K = df_10K_t[w].values.astype('U')
y_10K = df_10K_t[t]
X_vect_10K  = tfidf_vect(X_10K)


#8K+10K (concat)
print("8K+10K (concat)")
X_vect_concat = csr_matrix(pd.concat([pd.DataFrame(X_vect_8K.todense()),
pd.DataFrame(X_vect_10K1.todense())], axis=1))
compute(X_vect_concat, y_10K1)


#8K+10K (sum)
print("#8K+10K (sum)")
X_vect_sum =
pd.DataFrame(X_vect_8K.todense()).add(pd.DataFrame(X_vect_10K1.todense()),
fill_value=0)
compute(X_vect_sum, y_10K1)


#changes
print("#changes")
X_vect_diff =
pd.DataFrame(X_vect_10K1.todense()).subtract(pd.DataFrame(X_vect_10K.todense()),
fill_value=0)
compute(X_vect_diff, y_10K1)

mae_file.close()
mse_file.close()
r2_file.close()
'''

#df = pd.read_csv("10K_t.csv")
#v = df[df.duplicated(['words'], keep=False)]
#v = pd.concat(g for _, g in df.groupby("words"))# if len(g) > 1)
#print(v)
#print(df['words'])

w = "words"
t = "leverage"


#8K
print("8K")
df_8K_t = pd.read_csv("8K_t.csv")
X_8K = df_8K_t[w].values.astype('U')
y_8K = df_8K_t[t]
X_vect_8K = tfidf_vect(X_8K)
compute(X_vect_8K, y_8K)
print("8K", type(X_vect_8K), X_vect_8K.shape)


#10K_t+1
print("10K_t+1")
df_10K_t1 = pd.read_csv("10K_t1.csv")
X_10K1 = df_10K_t1[w].values.astype('U')
y_10K1 = df_10K_t1[t]
X_vect_10K1 = tfidf_vect(X_10K1)
compute(X_vect_10K1, y_10K1)
print("10K_t1", type(X_vect_10K1), X_vect_10K1.shape)


#10K_t
print("10K_t")
df_10K_t = pd.read_csv("10K_t.csv")
X_10K = df_10K_t[w].values.astype('U')
y_10K = df_10K_t[t]
X_vect_10K  = tfidf_vect(X_10K)
compute(X_vect_10K, y_10K)
print("10K: ", type(X_vect_10K), X_vect_10K.shape)


#8K+10K (concat)
print("8K+10K (concat)")
X_vect_concat = csr_matrix(pd.concat([pd.DataFrame(X_vect_8K.todense()),
pd.DataFrame(X_vect_10K1.todense())], axis=1))
compute(X_vect_concat, y_10K1)
print("8K +10K concat: ", type(X_vect_concat), X_vect_concat.shape)


#8K+10K (sum)
print("#8K+10K (sum)")
X_vect_sum =
pd.DataFrame(X_vect_8K.todense()).add(pd.DataFrame(X_vect_10K1.todense()),
fill_value=0)
compute(X_vect_sum, y_10K1)
print("8K + 10K sum: ", type(X_vect_sum), X_vect_sum.shape)


#changes
print("#changes")
X_vect_diff =
pd.DataFrame(X_vect_10K1.todense()).subtract(pd.DataFrame(X_vect_10K.todense()),
fill_value=0)
compute(X_vect_diff, y_10K1)
print("changes: ", type(X_vect_diff), X_vect_diff.shape)

print((X_vect_10K1.todense()==X_vect_diff.todense()))


print("Total execution time: ", time.time() - start)

On Thu, Mar 5, 2020 at 10:08 AM Jaime Ferrando Huertas <fhjaime96 at gmail.com>
wrote:

> Can you provide the code that produces this output?
>
> El jue., 5 mar. 2020 a las 16:03, Vipula Rawte (<rawtevipula25 at gmail.com>)
> escribi?:
>
>> I am getting identical metric evaluation values for different data, I
>> printed the matrix shape too.
>>
>> Below is a screenshot:
>>
>> [image: image.png]
>>
>> --
>> Regards,
>> Vipula Rawte
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Regards,
Vipula Rawte
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/d5d62ba5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126984 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/d5d62ba5/attachment-0001.png>

From rawtevipula25 at gmail.com  Thu Mar  5 12:09:55 2020
From: rawtevipula25 at gmail.com (Vipula Rawte)
Date: Thu, 5 Mar 2020 12:09:55 -0500
Subject: [scikit-learn] Getting identical mse, r2, mae for different data
In-Reply-To: <CAK-ObqtZWzZeY=s=r+uRKi0F+wyb7q2PhzOF9b6q8tUSHEKHOA@mail.gmail.com>
References: <CAMX56PhB8n8AZ3rQccHNDdsMgzCVygZiaxNAtP0Udfi1ErL_1w@mail.gmail.com>
 <CAK-ObqtZWzZeY=s=r+uRKi0F+wyb7q2PhzOF9b6q8tUSHEKHOA@mail.gmail.com>
Message-ID: <CAMX56Phgn4JrTD8CC3NptYGxOa9iuxa8CrUm0_nZJhcD9yEhXg@mail.gmail.com>

On Thu, Mar 5, 2020 at 10:08 AM Jaime Ferrando Huertas <fhjaime96 at gmail.com>
wrote:

> Can you provide the code that produces this output?
>
> El jue., 5 mar. 2020 a las 16:03, Vipula Rawte (<rawtevipula25 at gmail.com>)
> escribi?:
>
>> I am getting identical metric evaluation values for different data, I
>> printed the matrix shape too.
>>
>> Below is a screenshot:
>>
>> [image: image.png]
>>
>> --
>> Regards,
>> Vipula Rawte
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Regards,
Vipula Rawte
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/08f27d08/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126984 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/08f27d08/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: refine_8K_perf_ind_prac.py
Type: text/x-python
Size: 6791 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/08f27d08/attachment-0001.py>

From t3kcit at gmail.com  Thu Mar  5 16:12:43 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 5 Mar 2020 16:12:43 -0500
Subject: [scikit-learn] distances
In-Reply-To: <CAAkaFLWFCsE9Qt1QMRu41NjsNrK7D0h+u2ANEPFmbG5ZNeXL6Q@mail.gmail.com>
References: <CAAkaFLWFCsE9Qt1QMRu41NjsNrK7D0h+u2ANEPFmbG5ZNeXL6Q@mail.gmail.com>
Message-ID: <c0203750-3865-2183-fed5-b5ebb948aa21@gmail.com>

Thanks for a great summary of issues!
I agree there's lots to do, though I think most of the issues that you 
list are quite hard and require thinking about API pretty hard.
So they might not be super amendable to being solved by a shorter-term 
project.

I was hoping there would be some more easy wins that we could get by 
exploiting OpenMP better (or at all) in the distances.
Not sure if there is, though.

I wonder if having a multicore implementation of euclidean_distances 
would be useful for us, or if that's going too low-level.


On 3/3/20 5:47 PM, Joel Nothman wrote:
> I noticed a comment by?@amueller on Gitter re?considering a project on 
> our distances implementations.
>
> I think there's a lot of work that can be done in unifying distances 
> implementations... (though I'm not always sure the benefit.) I thought 
> I would?summarise some of the issues below, as I was unsure what Andy 
> intended.
>
> As @jeremiedbb said, making n_jobs more effective would be beneficial. 
> Reducing duplication between metrics.pairwise and 
> neighbors._dist_metrics and kmeans would?be noble (especially with 
> regard to parameters, where scicpy.spatial's mahalanobis available 
> through sklearn.metrics does not accept V but sklearn.neighbors does). 
> and perhaps offer higher consistency of results and efficiencies.
>
> We also have idioms the code like "if the metric is euclidean, use 
> squared=True where we only need a ranking, then take the squareroot" 
> while neighbors metrics abstract this with an API by providing rdsit 
> and rdist_to_dist.
>
> There are issues about making sure that 
> pairwise_distances(metric='minkowski', p=2) is using the same 
> implementation as pairwise_distances(metric='euclidean'), etc.
>
> We have issues with chunking and distributing computations in the case 
> that metric params are derived from the dataset (ideally a training?set).
>
> #16419 is a simple instance where the metric param is sample-aligned 
> and needs to be chunked up.
>
> In other cases, we precompute some metric param over all the data, 
> then pass it to each chunk worker, using _precompute_metric_params 
> introduced in #12672. This is also relevant to #9555.
>
> While that initial implementation in #12672 is helpful and aims to 
> maintain backwards compatibility, it makes some dubious choices.
>
> Firstly in terms of code structure it is not a very modular approach - 
> each metric is handled with an if-then. Secondly, it *only* handles 
> the chunking case, relying on the fact that these metrics are in 
> scipy.spatial, and have a comparable handling of V=None and VI=None. 
> In the Gower Distances PR (#9555) when implementing a metric locally, 
> rather than relying on scipy.spatial, we needed to provide an 
> implementation of these default parameters both when the data is 
> chunked and when the metric function is called straight out.
>
> Thirdly, its approach to training vs test data is dubious. We don't 
> formally label X and Y in pairwise_distances as train/test, and 
> perhaps we should. Maintaining backwards compat with scipy's 
> seuclidean and mahalanobis, our implementation stacks X and Y to each 
> other if both are provided, and then calculates their variance. This 
> means that users may be applying a different metric at train and at 
> test time (if the variance of X as train and Y as test is 
> substantially different), which I consider a silent error. We can 
> either make the train/test nature of X and Y more explicit, or we can 
> require that data-based parameters are given explicitly by the user 
> and not implicitly computed. If I understand correctly, 
> sklearn.neighbors will not compute V or VI for you, and it must be 
> provided explicitly. (Requiring that the scaling of each feature be 
> given explicitly in Gower seems like an unnecessary burden on the 
> user, however.)
>
> Then there are issues like whether we should consistently set the 
> diagonal to zero in all metrics where Y=None.
>
> In short, there are several projects in distances, and I'd support 
> them being considered for work.... But it's a lot of engineering, if 
> motivated by ML needs and consistency for users.
>
> J
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/d2ea43aa/attachment.html>

From jeremie.du-boisberranger at inria.fr  Fri Mar  6 05:00:45 2020
From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger)
Date: Fri, 6 Mar 2020 11:00:45 +0100
Subject: [scikit-learn] distances
In-Reply-To: <c0203750-3865-2183-fed5-b5ebb948aa21@gmail.com>
References: <CAAkaFLWFCsE9Qt1QMRu41NjsNrK7D0h+u2ANEPFmbG5ZNeXL6Q@mail.gmail.com>
 <c0203750-3865-2183-fed5-b5ebb948aa21@gmail.com>
Message-ID: <30c505cf-c178-b81c-6aa5-bf047baeaede@inria.fr>

Although pairwise distances are very good candidates for OpenMP based 
multi-threading due to their embarrassingly parallel nature, I think 
euclidean distances (from the pairwise module) is the one which will 
less benefit from that. It's implementation, using the dot trick, uses 
BLAS level 3 routine (matrix matrix multiplication) which will always be 
better optimized, better parallelized, have runtime cpu detection.

Side note: What really makes KMeans faster is not the fact that 
euclidean distances are computed by chunks, it's because the chunked 
pairwise distance matrix fits in cache so it stays there for the 
following operations on this matrix (finding labels, partially update 
centers). So it does not apply to only computing euclidean distances.

On the other hand, other metrics don't all have internal 
multi-threading, and probably none rely on level 3 BLAS routines. 
Usually computing pairwise distances does not involve a lot of 
computations and is quite fast, so parallelizing them with joblib has no 
benefit due to the joblib overhead being bigger than the computations 
themselves. Unless the data is big enough but memory issues will happen 
before that :) Those metrics could probably benefit from OpenMP based 
multithreading.

About going too low-level, we already have this DistanceMetric module 
implementing all metrics in cython, so I'd say we're already kind of 
low-level and in that case, using OpenMP would really just be adding a 
'p' before 'range' :) I think a good first step could be to move this 
module in metrics, where it really belongs, rework it to make it fused 
typed and sparse friendly, and add some prange. Obviously it will keep 
most of the API flaws that @jnothman exposed but it might set up a 
cleaner ground for future API changes.

In the end, whatever you choose, I'd be happy to help.

J?r?mie (@jeremiedbb)


On 05/03/2020 22:12, Andreas Mueller wrote:
> Thanks for a great summary of issues!
> I agree there's lots to do, though I think most of the issues that you 
> list are quite hard and require thinking about API pretty hard.
> So they might not be super amendable to being solved by a shorter-term 
> project.
>
> I was hoping there would be some more easy wins that we could get by 
> exploiting OpenMP better (or at all) in the distances.
> Not sure if there is, though.
>
> I wonder if having a multicore implementation of euclidean_distances 
> would be useful for us, or if that's going too low-level.
>
>
>
> On 3/3/20 5:47 PM, Joel Nothman wrote:
>> I noticed a comment by?@amueller on Gitter re?considering a project 
>> on our distances implementations.
>>
>> I think there's a lot of work that can be done in unifying distances 
>> implementations... (though I'm not always sure the benefit.) I 
>> thought I would?summarise some of the issues below, as I was unsure 
>> what Andy intended.
>>
>> As @jeremiedbb said, making n_jobs more effective would be 
>> beneficial. Reducing duplication between metrics.pairwise and 
>> neighbors._dist_metrics and kmeans would?be noble (especially with 
>> regard to parameters, where scicpy.spatial's mahalanobis available 
>> through sklearn.metrics does not accept V but sklearn.neighbors 
>> does). and perhaps offer higher consistency of results and efficiencies.
>>
>> We also have idioms the code like "if the metric is euclidean, use 
>> squared=True where we only need a ranking, then take the squareroot" 
>> while neighbors metrics abstract this with an API by providing rdsit 
>> and rdist_to_dist.
>>
>> There are issues about making sure that 
>> pairwise_distances(metric='minkowski', p=2) is using the same 
>> implementation as pairwise_distances(metric='euclidean'), etc.
>>
>> We have issues with chunking and distributing computations in the 
>> case that metric params are derived from the dataset (ideally a 
>> training?set).
>>
>> #16419 is a simple instance where the metric param is sample-aligned 
>> and needs to be chunked up.
>>
>> In other cases, we precompute some metric param over all the data, 
>> then pass it to each chunk worker, using _precompute_metric_params 
>> introduced in #12672. This is also relevant to #9555.
>>
>> While that initial implementation in #12672 is helpful and aims to 
>> maintain backwards compatibility, it makes some dubious choices.
>>
>> Firstly in terms of code structure it is not a very modular approach 
>> - each metric is handled with an if-then. Secondly, it *only* handles 
>> the chunking case, relying on the fact that these metrics are in 
>> scipy.spatial, and have a comparable handling of V=None and VI=None. 
>> In the Gower Distances PR (#9555) when implementing a metric locally, 
>> rather than relying on scipy.spatial, we needed to provide an 
>> implementation of these default parameters both when the data is 
>> chunked and when the metric function is called straight out.
>>
>> Thirdly, its approach to training vs test data is dubious. We don't 
>> formally label X and Y in pairwise_distances as train/test, and 
>> perhaps we should. Maintaining backwards compat with scipy's 
>> seuclidean and mahalanobis, our implementation stacks X and Y to each 
>> other if both are provided, and then calculates their variance. This 
>> means that users may be applying a different metric at train and at 
>> test time (if the variance of X as train and Y as test is 
>> substantially different), which I consider a silent error. We can 
>> either make the train/test nature of X and Y more explicit, or we can 
>> require that data-based parameters are given explicitly by the user 
>> and not implicitly computed. If I understand correctly, 
>> sklearn.neighbors will not compute V or VI for you, and it must be 
>> provided explicitly. (Requiring that the scaling of each feature be 
>> given explicitly in Gower seems like an unnecessary burden on the 
>> user, however.)
>>
>> Then there are issues like whether we should consistently set the 
>> diagonal to zero in all metrics where Y=None.
>>
>> In short, there are several projects in distances, and I'd support 
>> them being considered for work.... But it's a lot of engineering, if 
>> motivated by ML needs and consistency for users.
>>
>> J
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200306/871a0e3d/attachment-0001.html>

From adityaselfefficient at gmail.com  Wed Mar 11 01:10:10 2020
From: adityaselfefficient at gmail.com (aditya aggarwal)
Date: Wed, 11 Mar 2020 10:40:10 +0530
Subject: [scikit-learn] Understanding max_features parameter in
 RandomForestClassifier
Message-ID: <CA+x=Jfa5o-dvCVUDeXVwcxBP4B15MxUnZCz08W4Uahn21TYqew@mail.gmail.com>

For RandomForestClassifier in sklearn

max_features parameter gives the max no of features for split in random
forest which is sqrt(n_features) as default. If m is sqrt of n, then no of
combinations for DT formation is nCm. What if nCm is less than n_estimators
(no of decision trees in random forest)?

*example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique
combinations of features for decision trees. Now for n_estimators = 100,
will the remaining 65 trees have repeated combination of features? If so,
won't trees be correlated introducing bias in the answer?


Thanks

Aditya Aggarwal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200311/d08c2d8e/attachment.html>

From adityaselfefficient at gmail.com  Wed Mar 11 01:22:22 2020
From: adityaselfefficient at gmail.com (aditya aggarwal)
Date: Wed, 11 Mar 2020 10:52:22 +0530
Subject: [scikit-learn] Threshold for roc_curve in binary classification
Message-ID: <CA+x=JfaMo04+UksZJ7YM81OCroCGGQAvPOD4RYaUN8FMVkgBAQ@mail.gmail.com>

Hello

I was going through the logic to calculate threshold to plot roc_curve. As
far as I could understand, fps, tps and threshold is calculated in
slklearn.metrics._binary_clf_curve . How are multiple values of threshold
calculated for binary classification?

Also what is happening in the following lines?
distinct_value_indices = np.where(np.diff(y_score))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200311/f3fa27be/attachment.html>

From jbbrown at kuhp.kyoto-u.ac.jp  Wed Mar 11 01:26:50 2020
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Wed, 11 Mar 2020 14:26:50 +0900
Subject: [scikit-learn] Understanding max_features parameter in
 RandomForestClassifier
In-Reply-To: <CA+x=Jfa5o-dvCVUDeXVwcxBP4B15MxUnZCz08W4Uahn21TYqew@mail.gmail.com>
References: <CA+x=Jfa5o-dvCVUDeXVwcxBP4B15MxUnZCz08W4Uahn21TYqew@mail.gmail.com>
Message-ID: <CAJe_vxCg+2pZKxEFOKSuSt7PN2Ena4nyuzxxxXq5FnRR9w1zuw@mail.gmail.com>

Regardless of the number of features, each DT estimator is given only a
subset of the data.
Each DT estimator then uses the features to derive decision rules for the
samples it was given.
With more trees and few examples, you might get similar or identical trees,
but that is not the norm.

Pardon brevity.
J.B.

2020?3?11?(?) 14:11 aditya aggarwal <adityaselfefficient at gmail.com>:

> For RandomForestClassifier in sklearn
>
> max_features parameter gives the max no of features for split in random
> forest which is sqrt(n_features) as default. If m is sqrt of n, then no of
> combinations for DT formation is nCm. What if nCm is less than n_estimators
> (no of decision trees in random forest)?
>
> *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique
> combinations of features for decision trees. Now for n_estimators = 100,
> will the remaining 65 trees have repeated combination of features? If so,
> won't trees be correlated introducing bias in the answer?
>
>
> Thanks
>
> Aditya Aggarwal
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200311/594fea6b/attachment.html>

From adityaselfefficient at gmail.com  Wed Mar 11 01:43:02 2020
From: adityaselfefficient at gmail.com (aditya aggarwal)
Date: Wed, 11 Mar 2020 11:13:02 +0530
Subject: [scikit-learn] Understanding max_features parameter in
 RandomForestClassifier
In-Reply-To: <CAJe_vxCg+2pZKxEFOKSuSt7PN2Ena4nyuzxxxXq5FnRR9w1zuw@mail.gmail.com>
References: <CA+x=Jfa5o-dvCVUDeXVwcxBP4B15MxUnZCz08W4Uahn21TYqew@mail.gmail.com>
 <CAJe_vxCg+2pZKxEFOKSuSt7PN2Ena4nyuzxxxXq5FnRR9w1zuw@mail.gmail.com>
Message-ID: <CA+x=JfbUF=kFw3BXmYukJb3irrpvjsnfJURu612kRAKY=g7Kdw@mail.gmail.com>

With all the parameters set to default, (especially bootstrap and
max_samples), no of samples passed to each estimator is X.shape[0]. Doesn't
it account for all the instances in the dataset with calculated no. of
feature? Then how come only a subset is given to the estimator?

On Wed, Mar 11, 2020 at 10:58 AM Brown J.B. via scikit-learn <
scikit-learn at python.org> wrote:

> Regardless of the number of features, each DT estimator is given only a
> subset of the data.
> Each DT estimator then uses the features to derive decision rules for the
> samples it was given.
> With more trees and few examples, you might get similar or identical
> trees, but that is not the norm.
>
> Pardon brevity.
> J.B.
>
> 2020?3?11?(?) 14:11 aditya aggarwal <adityaselfefficient at gmail.com>:
>
>> For RandomForestClassifier in sklearn
>>
>> max_features parameter gives the max no of features for split in random
>> forest which is sqrt(n_features) as default. If m is sqrt of n, then no of
>> combinations for DT formation is nCm. What if nCm is less than n_estimators
>> (no of decision trees in random forest)?
>>
>> *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique
>> combinations of features for decision trees. Now for n_estimators = 100,
>> will the remaining 65 trees have repeated combination of features? If so,
>> won't trees be correlated introducing bias in the answer?
>>
>>
>> Thanks
>>
>> Aditya Aggarwal
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200311/194583ed/attachment-0001.html>

From venky.yuvy at gmail.com  Wed Mar 11 04:18:27 2020
From: venky.yuvy at gmail.com (Venkatachalam N)
Date: Wed, 11 Mar 2020 13:48:27 +0530
Subject: [scikit-learn] Understanding max_features parameter in
 RandomForestClassifier
In-Reply-To: <CA+x=JfbUF=kFw3BXmYukJb3irrpvjsnfJURu612kRAKY=g7Kdw@mail.gmail.com>
References: <CA+x=Jfa5o-dvCVUDeXVwcxBP4B15MxUnZCz08W4Uahn21TYqew@mail.gmail.com>
 <CAJe_vxCg+2pZKxEFOKSuSt7PN2Ena4nyuzxxxXq5FnRR9w1zuw@mail.gmail.com>
 <CA+x=JfbUF=kFw3BXmYukJb3irrpvjsnfJURu612kRAKY=g7Kdw@mail.gmail.com>
Message-ID: <CAJjN--vdfx6tvL6=1TToD7MYkU3euAsYexKGWosC2-h8CVC-BA@mail.gmail.com>

Hi Aditya,

The sampling is done with replacement with the default settings.
Hence, you will get different dataset even though you sample same number
(`X.shape[0]`) of datapoints.

Regards,
Venkatachalam N.


On Wed, Mar 11, 2020 at 11:14 AM aditya aggarwal <
adityaselfefficient at gmail.com> wrote:

> With all the parameters set to default, (especially bootstrap and
> max_samples), no of samples passed to each estimator is X.shape[0]. Doesn't
> it account for all the instances in the dataset with calculated no. of
> feature? Then how come only a subset is given to the estimator?
>
> On Wed, Mar 11, 2020 at 10:58 AM Brown J.B. via scikit-learn <
> scikit-learn at python.org> wrote:
>
>> Regardless of the number of features, each DT estimator is given only a
>> subset of the data.
>> Each DT estimator then uses the features to derive decision rules for the
>> samples it was given.
>> With more trees and few examples, you might get similar or identical
>> trees, but that is not the norm.
>>
>> Pardon brevity.
>> J.B.
>>
>> 2020?3?11?(?) 14:11 aditya aggarwal <adityaselfefficient at gmail.com>:
>>
>>> For RandomForestClassifier in sklearn
>>>
>>> max_features parameter gives the max no of features for split in random
>>> forest which is sqrt(n_features) as default. If m is sqrt of n, then no of
>>> combinations for DT formation is nCm. What if nCm is less than n_estimators
>>> (no of decision trees in random forest)?
>>>
>>> *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35
>>> unique combinations of features for decision trees. Now for n_estimators =
>>> 100, will the remaining 65 trees have repeated combination of features? If
>>> so, won't trees be correlated introducing bias in the answer?
>>>
>>>
>>> Thanks
>>>
>>> Aditya Aggarwal
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200311/3861ac60/attachment.html>

From gianmarcofucci94 at gmail.com  Mon Mar 16 05:42:44 2020
From: gianmarcofucci94 at gmail.com (Gianmarco Fucci)
Date: Mon, 16 Mar 2020 10:42:44 +0100
Subject: [scikit-learn] Study on annotation of design and implementation
 choices, and of technical debt
Message-ID: <CAMNeubR6kM+1UznJNvXrPdky8s=E4c4LV4v6MXEUjRiw+1EfBg@mail.gmail.com>

Dear all,

As software engineering research teams at the University of Sannio (Italy)
and Eindhoven University of Technology (The Netherlands) we are interested
in investigating the protocol used by developers while they have to
annotate implementation and design choices during their normal development
activities. More specifically, we are looking at whether, where and what
kind of annotations developers usually use trying to be focused more on
those annotations mainly aimed at highlighting that the code is not in the
right shape (e.g., comments for annotating delayed or intended work
activities such as TODO, FIXME, hack, workaround, etc). In the latter case,
we are looking at what is the content of the above annotations, as well as
how they usually behave while evolving the code that has been previously
annotated.

When answering the survey, in case your annotation practices are different
in different open source projects you may contribute, please refer to how
you behave for the projects where you have been contacted.

Filling out the survey will take about 5 minutes.

Please note that your identity and personal data will not be disclosed,
while we plan to use the aggregated results and anonymized responses as
part of a scientific publication.

If you have any questions about the questionnaire or our research, please
do not hesitate to contact us.

You can find the survey link here:

https://forms.gle/NxdVXiZQSmQ15U4T8

Thanks and regards,

Gianmarco Fucci (gianmarcofucci94 at gmail.com)
Fiorella Zampetti (fzampetti at unisannio.it)
Alexander Serebrenik (a.serebrenik at tue.nl)
Massimiliano Di Penta (dipenta at unisannio.it)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200316/330e4c4e/attachment.html>

From nelle.varoquaux at gmail.com  Tue Mar 17 11:37:11 2020
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Tue, 17 Mar 2020 16:37:11 +0100
Subject: [scikit-learn] Announcing the 2020 John Hunter Excellence in
 Plotting Contest
Message-ID: <CAE-UAvTPVo6LFry0d7azAA-Efx6tan8PbgMny3mC5PjD7mP8eQ@mail.gmail.com>

Dear all,


I apologize for the cross-posting.


In memory of John Hunter, we are pleased to announce the John Hunter
Excellence in Plotting Contest for 2020. This open competition aims to
highlight the importance of data visualization to scientific progress and
showcase the capabilities of open source software.

Participants are invited to submit scientific plots to be judged by a
panel. The winning entries will be announced and displayed at SciPy 2020 or
announced in the John Hunter Excellence in Plotting Contest website and
youtube channel.

John Hunter?s family are graciously sponsoring cash prizes for the winners
in the following amounts:

- 1st prize: $1000

- 2nd prize: $750

- 3rd prize: $500


   -

   Entries must be submitted by June 1st to the form at
   https://forms.gle/SrexmkDwiAmDc7ej7
   -

   Winners will be announced at Scipy 2020 in Austin, TX or publicly on the
   John Hunter Excellence in Plotting Contest website and youtube channel
   -

   Participants do not need to attend the Scipy conference.
   -

   Entries may take the definition of ?visualization? rather broadly.
   Entries may be, for example, a traditional printed plot, an interactive
   visualization for the web, a dashboard, or an animation.
   -

   Source code for the plot must be provided, in the form of Python code
   and/or a Jupyter notebook, along with a rendering of the plot in a widely
   used format.  The rendering may be, for example, PDF for print, standalone
   HTML and Javascript for an interactive plot, or MPEG-4 for a video. If the
   original data can not be shared for reasons of size or licensing, "fake"
   data may be substituted, along with an image of the plot using real data.
   -

   Each entry must include a 300-500 word abstract describing the plot and
   its importance for a general scientific audience.
   -

   Entries will be judged on their clarity, innovation and aesthetics, but
   most importantly for their effectiveness in communicating a real-world
   problem. Entrants are encouraged to submit plots that were used during the
   course of research or work, rather than merely being hypothetical.
   -

   SciPy and the John Hunter Excellence in Plotting Contest organizers
   reserves the right to display any and all entries, whether prize-winning or
   not, at the conference, use in any materials or on its website, with
   attribution to the original author(s).
   -

   Past entries can be found at https://jhepc.github.io/
   -

   Questions regarding the contest can be sent to jhepc.organizers at gmail.com


John Hunter Excellence in Plotting Contest Co-Chairs

Madicken Munk

Nelle Varoquaux
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200317/b4184cbe/attachment.html>

From jbc.develop at gmail.com  Wed Mar 18 17:42:18 2020
From: jbc.develop at gmail.com (Juan BC)
Date: Wed, 18 Mar 2020 18:42:18 -0300
Subject: [scikit-learn] The Coronavirus Tech Handbook
Message-ID: <CAFT7ZuiJ7TQ5xRK0t2KWYeJqnARBxjaCnSL3oEtAtfn6W9UL8A@mail.gmail.com>

Sorry for the offtopic

https://coronavirustechhandbook.com/ <<<< The Coronavirus Tech Handbook
provides a space for technologists, specialists, civic organisations and
public & private institutions to collaborate on a rapid and sophisticated
response to the coronavirus outbreak. It is a dynamic resource with many
hundreds of contributors that is evolving very quickly.

-- 
Juan B Cabral
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200318/5f86b7c2/attachment.html>

From gk68118 at gmail.com  Thu Mar 19 02:11:49 2020
From: gk68118 at gmail.com (Praneet Singh)
Date: Thu, 19 Mar 2020 11:41:49 +0530
Subject: [scikit-learn] transfer learning doubt
Message-ID: <CAG94DV13-_fV2c9MoL3sNbUvPTf=c2dOcGPFD9=cQHj2=d2wvg@mail.gmail.com>

 I am training a SGD Classifier with some training dataset which is
temporary and will be lost after sometime. So I am planning to save the
model in pickle file and reuse it and train again with some another dataset
that arrives. But It forgets the previously learned data.

As far as I researched in google, tensorflow model allows transfer learning
and not forgetting the previous learning but is there any other way with
sklearn model to achieve this??
any help would be appreciated
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200319/2bdae071/attachment.html>

From fad469 at uregina.ca  Thu Mar 19 09:19:38 2020
From: fad469 at uregina.ca (Farzana Anowar)
Date: Thu, 19 Mar 2020 07:19:38 -0600
Subject: [scikit-learn] transfer learning doubt
In-Reply-To: <CAG94DV13-_fV2c9MoL3sNbUvPTf=c2dOcGPFD9=cQHj2=d2wvg@mail.gmail.com>
References: <CAG94DV13-_fV2c9MoL3sNbUvPTf=c2dOcGPFD9=cQHj2=d2wvg@mail.gmail.com>
Message-ID: <a166b4f26a3585f960fd03a660fcecc4@uregina.ca>

On 2020-03-19 00:11, Praneet Singh wrote:
> I am training a SGD Classifier with some training dataset which is
> temporary and will be lost after sometime. So I am planning to save
> the model in pickle file and reuse it and train again with some
> another dataset that arrives. But It forgets the previously learned
> data.
> 
> As far as I researched in google, tensorflow model allows transfer
> learning and not forgetting the previous learning but is there any
> other way with sklearn model to achieve this??
> any help would be appreciated
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
Did you use incremental estimator and partial _fit? If not, try to use 
them. Should work.

Another option is to us deep learning and store the weights for the 
first model and initialize the second model with that weight and keep 
doing it for the rest of the models.
-- 
Best Regards,

Farzana Anowar,
PhD Candidate
Department of Computer Science
University of Regina

From rth.yurchak at gmail.com  Thu Mar 19 10:06:37 2020
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Thu, 19 Mar 2020 15:06:37 +0100
Subject: [scikit-learn] transfer learning doubt
In-Reply-To: <a166b4f26a3585f960fd03a660fcecc4@uregina.ca>
References: <CAG94DV13-_fV2c9MoL3sNbUvPTf=c2dOcGPFD9=cQHj2=d2wvg@mail.gmail.com>
 <a166b4f26a3585f960fd03a660fcecc4@uregina.ca>
Message-ID: <6ac57655-1ac5-34ea-416b-6c65d641ba7b@gmail.com>

On 19/03/2020 14:19, Farzana Anowar wrote:
 > Another option is to us deep learning and store the weights for the 
first model and initialize the second model with that weight and keep 
doing it for the rest of the models.

This can also be done in scikit-learn with models that support 
warm_start=True init parameter (including SGDClassifier).

Roman

From krallinger.martin at gmail.com  Thu Mar 19 12:21:36 2020
From: krallinger.martin at gmail.com (Martin Krallinger)
Date: Thu, 19 Mar 2020 17:21:36 +0100
Subject: [scikit-learn] Final CFP CodiEsp: Clinical Case Coding Task
 (eHealth CLEF 2020)
In-Reply-To: <mailman.37.1584633604.23549.scikit-learn@python.org>
References: <mailman.37.1584633604.23549.scikit-learn@python.org>
Message-ID: <CAMx+MKHZjWK7QGecUOp6iqdhx=kJw_kGeK3fQi4zt3qub_tyPw@mail.gmail.com>

*** Call for Participation CodiEsp: Clinical Case Coding Task (eHealth CLEF
2020) *** *


*CodiEsp (eHealth CLEF? Multilingual Information Extraction) Shared Task on
automatic assignment of ICD10 codes (procedures, diagnosis) track at CLEF
2020*

http://temu.bsc.es/codiesp


Plan TL Award for the CodiEsp Track


The CodiEsp sub-tracks:

*1.CodiEsp Diagnosis Coding *sub-task* (CodiEsp-D)*: will require automatic
ICD10-CM [CIE10 Diagn?stico] code assignment to each clinical case document.

*2.CodiEsp Procedure Coding *sub-task* (CodiEsp-P):* will require automatic
ICD10-PCS [CIE10 Procedimiento] code assignment to each clinical case
document.

*3.CodiEsp Explainable AI *exploratory sub-task* (CodiEsp-X).* Systems are
required to extract the evidence text supporting the predicted codes (both
ICD10-CM and ICD10-PCS).


*Task description*

Clinical coding essentially requires the transformation (or classification)
of medical texts into a structured or coded format using internationally
recognized class codes.

These codes describe a patient?s diagnosis or treatment. Clinical coding is
critical for standardizing electronic clinical records; enable aetiology
studies, monitor health trends, carry out epidemiology studies, clinical
and biomedical research, assist clinical decision-making or even
reimbursement.

As part of the eHealth CLEF (http://clef-ehealth.org) Multilingual
Information Extraction Shared Task we organize* CodiEsp: Clinical Case
Coding Task (http://temu.bsc.es/codiesp <http://temu.bsc.es/codiesp>). *The
CodiEsp task will address the automatic extraction and assignment of
clinical coding (diagnosis and procedures) to clinical case documents in
Spanish.

To enable participation of researches around the world, in addition to the
basic data in Spanish, we will also publish versions of the training,
development, and test set *automatically translated into English*.

Participating systems will be asked to automatically assign ICD10 codes (or
CIE-10, in Spanish) to clinical case documents. Evaluation is done through
comparison to manually assigned ICD10 codes.

*Publications and workshop*

As in previous eHealth CLEF efforts, there will be an *evaluation workshop
allocated at CLEF 2020* where participating teams can present their systems
and results. Moreover, participating teams will be invited to submit their
system description papers for publication at the *CLEF 2020 Working Notes
proceedings*. For previous working notes see: http://ceur-ws.org/Vol-2125/


*CodiEsp awards*

There will we three awards for the top-scoring teams promoted by the
Spanish Plan for the Advancement of Language Technology (Plan TL) and the
Barcelona Supercomputing Center (BSC).

--------------------------------------

*Participation and useful info*

--------------------------------------

   1. CodiEsp web, info & detailed description:http://temu.bsc.es/codiesp/
   2. Registration for CodiEsp (Multilingual Information Extraction eHealth
   track):http://temu.bsc.es/codiesp/index.php/registration/
   3. Datasets:https://zenodo.org/record/3693570
   <https://zenodo.org/record/3693570#.Xl5bK9-YU5k>
   4. Additional training resources:https://doi.org/10.5281/zenodo.3606662


------------------------

*Main CodiEsp Track organizers*

------------------------

   - *Martin Krallinger*, Barcelona Supercomputing Center.
   - *Antonio Miranda*, Barcelona Supercomputing Center.
   - *Aitor Gonzalez-Agirre*, Barcelona Supercomputing Center.
   - *Marta Villegas*, Barcelona Supercomputing Center.
   - *Jordi Armengol*, Barcelona Supercomputing Center.


------------------------

*Important Dates*

------------------------

Jan 13: Training and development set release
March 2: Test and background set release
May 3: End of evaluation
May 5: Results notified
May  24: Paper submission
Jun 28: Camera-ready paper submission
Sep 22-25: CLEF 2020 Conference (Thessaloniki, Greece)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200319/2627b2dc/attachment.html>

From MC_George123 at hotmail.com  Wed Mar 25 22:16:03 2020
From: MC_George123 at hotmail.com (MC_George123 at hotmail.com)
Date: Thu, 26 Mar 2020 02:16:03 +0000
Subject: [scikit-learn] A basic question about kmeans algorithms elkan and
 llyod
Message-ID: <SN6PR06MB48641F16AD566163A4E335E4B3CF0@SN6PR06MB4864.namprd06.prod.outlook.com>

Hi admins,

My team is working on optimization on scikit-learn staff now. When it comes to kmeans, I find there are two algorithms, one of which is lloyd and the other is elkan, which is the optimized one for lloyd using triangle inequality.  In the older version of scikit-learn, elkan only supports dense dataset instead of sparse one. And in the latest version, elkan supports both type of datasets. So there is a question why both two algorithms are kept in kmeans since they do the almost same thing and elkan is a optimized one for lloyd. Are there any precision difference between two algorithms and how can I decide what algorithm to use?

Best regards,
George Fan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200326/8c076dbd/attachment.html>

From alexandre.gramfort at inria.fr  Thu Mar 26 03:40:15 2020
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Thu, 26 Mar 2020 08:40:15 +0100
Subject: [scikit-learn] A basic question about kmeans algorithms elkan
 and llyod
In-Reply-To: <SN6PR06MB48641F16AD566163A4E335E4B3CF0@SN6PR06MB4864.namprd06.prod.outlook.com>
References: <SN6PR06MB48641F16AD566163A4E335E4B3CF0@SN6PR06MB4864.namprd06.prod.outlook.com>
Message-ID: <CADeotZqb=viYLYvui6fDZEQSLP-HAyUgvvdxVdGeyJdb-7d0vA@mail.gmail.com>

hi,

I suspect Elkan is really winning when you have many centroids
so the conclusion is not systematic

my 2c
Alex


On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com <
MC_George123 at hotmail.com> wrote:

> Hi admins,
>
>
>
> My team is working on optimization on scikit-learn staff now. When it
> comes to kmeans, I find there are two algorithms, one of which is lloyd and
> the other is elkan, which is the optimized one for lloyd using triangle
> inequality.  In the older version of scikit-learn, elkan only supports
> dense dataset instead of sparse one. And in the latest version, elkan
> supports both type of datasets. So there is a question why both two
> algorithms are kept in kmeans since they do the almost same thing and elkan
> is a optimized one for lloyd. Are there any precision difference between
> two algorithms and how can I decide what algorithm to use?
>
>
>
> Best regards,
>
> George Fan
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200326/c649667e/attachment.html>

From niourf at gmail.com  Thu Mar 26 15:59:25 2020
From: niourf at gmail.com (Nicolas Hug)
Date: Thu, 26 Mar 2020 15:59:25 -0400
Subject: [scikit-learn] Monthly meetings
Message-ID: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com>

Hi all,

The next scikit-learn monthly meeting will take place on Monday 
(https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=3&day=30&hour=11&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 
<https://www.google.com/url?q=https://www.timeanddate.com/worldclock/meetingdetails.html?year%3D2020%26month%3D3%26day%3D30%26hour%3D11%26min%3D0%26sec%3D0%26p1%3D240%26p2%3D33%26p3%3D37%26p4%3D179%26p5%3D195&sa=D&ust=1585684660288000&usg=AOvVaw0mM0e2IkergY8rlqcNlh0L>)

While these meetings are mainly for core-devs to discuss the current 
topics, we're also happy to welcome non-core devs and other projects 
maintainers! Feel free to join.


*Location:*

Join Zoom Meeting

https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09 
<https://www.google.com/url?q=https://anaconda.zoom.us/j/947129165?pwd%3DdEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09&sa=D&ust=1582671384105000&usg=AOvVaw1xGYPazK6DdI3O1ejkYOTd>

Meeting ID: 947 129 165 Password: 586745


Thanks,

Nicolas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200326/6561bf8f/attachment.html>

From t3kcit at gmail.com  Fri Mar 27 12:32:39 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 27 Mar 2020 12:32:39 -0400
Subject: [scikit-learn] Analysis of sklearn and other python libraries on
 github by MS team
Message-ID: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com>

Hey all.
There's a pretty cool paper by a team at MS that analyses public github 
repos for their use of the sklearn and related libraries:
https://arxiv.org/abs/1912.09536

Thought it might be of interest.

Cheers,
Andy

From t3kcit at gmail.com  Fri Mar 27 12:36:52 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 27 Mar 2020 12:36:52 -0400
Subject: [scikit-learn] A basic question about kmeans algorithms elkan
 and llyod
In-Reply-To: <CADeotZqb=viYLYvui6fDZEQSLP-HAyUgvvdxVdGeyJdb-7d0vA@mail.gmail.com>
References: <SN6PR06MB48641F16AD566163A4E335E4B3CF0@SN6PR06MB4864.namprd06.prod.outlook.com>
 <CADeotZqb=viYLYvui6fDZEQSLP-HAyUgvvdxVdGeyJdb-7d0vA@mail.gmail.com>
Message-ID: <cdc9e447-5e48-3b86-3f31-54b6c04afb56@gmail.com>

There's an interesting analysis in this paper:
Fast K-Means with Accurate Bounds

http://proceedings.mlr.press/v48/newling16.pdf


On 3/26/20 3:40 AM, Alexandre Gramfort wrote:
> hi,
>
> I suspect Elkan is really winning when you have many centroids
> so the conclusion is not systematic
>
> my 2c
> Alex
>
>
> On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com 
> <mailto:MC_George123 at hotmail.com> <MC_George123 at hotmail.com 
> <mailto:MC_George123 at hotmail.com>> wrote:
>
>     Hi admins,
>
>     My team is working on optimization on scikit-learn staff now. When
>     it comes to kmeans, I find there are two algorithms, one of which
>     is lloyd and the other is elkan, which is the optimized one for
>     lloyd using triangle inequality.? In the older version of
>     scikit-learn, elkan only supports dense dataset instead of sparse
>     one. And in the latest version, elkan supports both type of
>     datasets. So there is a question why both two algorithms are kept
>     in kmeans since they do the almost same thing and elkan is a
>     optimized one for lloyd. Are there any precision difference
>     between two algorithms and how can I decide what algorithm to use?
>
>     Best regards,
>
>     George Fan
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200327/fc952465/attachment.html>

From rth.yurchak at gmail.com  Fri Mar 27 13:10:28 2020
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Fri, 27 Mar 2020 18:10:28 +0100
Subject: [scikit-learn] Analysis of sklearn and other python libraries
 on github by MS team
In-Reply-To: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com>
References: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com>
Message-ID: <e1c6191f-c3cd-5053-e42a-af492fd864b9@gmail.com>

Very interesting! A few comments,

 > From GH17, we managed to extract only 10.5k pipelines.  The 
relatively low frequency (with respect to the number of notebooks using 
SCIKIT-LEARN [..]) indicates a non-wide adoption of this specification. 
However, the number of pipelines in the GH19 corpus is 132k pipelines 
(i.e., an increase of 13? [..] since 2017).

It's nice to see that pipelines are indeed widely used.

 > Top-5 transformers [from imports] in GH19 are StandardScaler, 
CountVectorizer, TfidfTransformer, PolynomialFeatures, TfidfVectorizer 
(in this order).  Same are the results for GH17 with the difference that 
PCA is instead of TfidfVectorizer.

Hmm, I would have expected OneHotEncoder somewhere at the top and much 
less text processing. If there is real usage of CountVectorizer and 
TfidfTransformer separately, then maybe deprecating TfidfVectorizer 
could be done https://github.com/scikit-learn/scikit-learn/issues/14951 
Though this ranking looks quite unexpected. I wonder if they have the 
full list and not just the top5.

 > Regarding learners, Top-5 in both GH17 and GH19 are 
LogisticRegression, MultinomialNB, SVC, LinearRegression, and 
RandomForestClassifier (in this order).

Maybe LinearRegression docstring should more strongly suggest to use 
Ridge with small regularization in practice.

-- 
Roman

On 27/03/2020 17:32, Andreas Mueller wrote:
> Hey all.
> There's a pretty cool paper by a team at MS that analyses public github 
> repos for their use of the sklearn and related libraries:
> https://arxiv.org/abs/1912.09536
> 
> Thought it might be of interest.
> 
> Cheers,
> Andy
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From gael.varoquaux at normalesup.org  Fri Mar 27 18:20:17 2020
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Fri, 27 Mar 2020 23:20:17 +0100
Subject: [scikit-learn] Analysis of sklearn and other python libraries
 on github by MS team
In-Reply-To: <e1c6191f-c3cd-5053-e42a-af492fd864b9@gmail.com>
References: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com>
 <e1c6191f-c3cd-5053-e42a-af492fd864b9@gmail.com>
Message-ID: <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org>

Thanks for the link Andy. This is indeed very interesting!

On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
> > Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression,
> > MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this
> > order).

> Maybe LinearRegression docstring should more strongly suggest to use Ridge
> with small regularization in practice.

Yes! I actually wonder if we should not remove LinearRegression. It's a
bit frightening me that so many people use it. The only time that I've
seen it used in a scientific people, it was a mistake and it shouldn't
have been used.

I seldom advocate for deprecating :).

G

From pedro.cardoso.code at gmail.com  Sun Mar 29 13:21:21 2020
From: pedro.cardoso.code at gmail.com (Pedro Cardoso)
Date: Sun, 29 Mar 2020 18:21:21 +0100
Subject: [scikit-learn] [GridSearchCV] Reduction of elapsed time at the
 second interation
Message-ID: <CACSmxPJkxZBUmQvpJSAukWNX3fmbRk96TkWZwLCB6BASTXQmJg@mail.gmail.com>

Hello fellows,

i am knew at slkearn and I have a question about GridSearchCV:

I am running the following code  at a jupyter notebook :

----------------------*code*-------------------------------

opt_models = dict()
for feature in [features1, features2, features3, features4]:
    cmb = CMB(x_train, y_train, x_test, y_test, feature)
    cmb.fit()
    cmb.predict()
    opt_models[str(feature)]=cmb.get_best_model()

-------------------------------------------------------

The CMB class is just a class that contains different classification models
(SVC, decision tree, etc...). When cmb.fit() is running, a gridSearchCV is
performed at the SVC model  (which is within the cmb instance) in order to
tune the hyperparameters C, gamma, and kernel. The SCV model is implemented
using the sklearn.svm.SVC class. Here is the output of the first and second
iteration of the for loop:

---------------------*output*-------------------------------------
-> 1st iteration


Fitting 5 folds for each of 12 candidates, totalling 60 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done  15 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done  19 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done  20 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done  22 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done  23 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    6.8s
[Parallel(n_jobs=-1)]: Done  27 tasks      | elapsed:    6.8s
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:    6.9s
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:    6.9s
[Parallel(n_jobs=-1)]: Done  30 tasks      | elapsed:    6.9s
[Parallel(n_jobs=-1)]: Done  31 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done  32 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done  35 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done  36 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    7.2s
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:    7.2s
[Parallel(n_jobs=-1)]: Done  39 tasks      | elapsed:    7.2s
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:    7.2s
[Parallel(n_jobs=-1)]: Done  41 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done  43 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done  44 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    7.5s


-> 2nd iteration

Fitting 5 folds for each of 12 candidates, totalling 60 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0260s.) Setting
batch_size=14.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    0.7s finished

---------------------------------------------------------------------------------------------------------------------


As you can see, the first iteration gets a elapsed time much larger than
the 2nd iteration. Does it make sense? I am afraid that the model  is doing
some kind of cache or shortcut from the 1st iteration, and consequently
could decrease the model training/performance? I already read the sklearn
documentation and I didn't saw any warning/note about this kind of
behaviour.

Thank you very much for your time :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200329/f7935463/attachment.html>

From MC_George123 at hotmail.com  Mon Mar 30 03:33:08 2020
From: MC_George123 at hotmail.com (=?utf-8?B?5qiKIOS5puWNjg==?=)
Date: Mon, 30 Mar 2020 07:33:08 +0000
Subject: [scikit-learn] A basic question about kmeans algorithms elkan
 and llyod
In-Reply-To: <cdc9e447-5e48-3b86-3f31-54b6c04afb56@gmail.com>
References: <SN6PR06MB48641F16AD566163A4E335E4B3CF0@SN6PR06MB4864.namprd06.prod.outlook.com>
 <CADeotZqb=viYLYvui6fDZEQSLP-HAyUgvvdxVdGeyJdb-7d0vA@mail.gmail.com>
 <cdc9e447-5e48-3b86-3f31-54b6c04afb56@gmail.com>
Message-ID: <SN6PR06MB48640615389863FA9CA52E18B3CB0@SN6PR06MB4864.namprd06.prod.outlook.com>

Hi,

Thanks for your suggestion of the paper. However, the paper shows many more algorithms and finds out different algorithms show different performance on dataset with various dimensions, Lloyd algorithm not included. What I want to know is that can we remove the Lloyd algorithm in kmeans of scikit-learn since elkan is an optimized on with better performance.

Best regards,
George

From: scikit-learn <scikit-learn-bounces+mc_george123=hotmail.com at python.org> On Behalf Of Andreas Mueller
Sent: Saturday, March 28, 2020 12:37 AM
To: scikit-learn at python.org
Subject: Re: [scikit-learn] A basic question about kmeans algorithms elkan and llyod

There's an interesting analysis in this paper:
Fast K-Means with Accurate Bounds

http://proceedings.mlr.press/v48/newling16.pdf

On 3/26/20 3:40 AM, Alexandre Gramfort wrote:
hi,

I suspect Elkan is really winning when you have many centroids
so the conclusion is not systematic

my 2c
Alex


On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com<mailto:MC_George123 at hotmail.com> <MC_George123 at hotmail.com<mailto:MC_George123 at hotmail.com>> wrote:
Hi admins,

My team is working on optimization on scikit-learn staff now. When it comes to kmeans, I find there are two algorithms, one of which is lloyd and the other is elkan, which is the optimized one for lloyd using triangle inequality.  In the older version of scikit-learn, elkan only supports dense dataset instead of sparse one. And in the latest version, elkan supports both type of datasets. So there is a question why both two algorithms are kept in kmeans since they do the almost same thing and elkan is a optimized one for lloyd. Are there any precision difference between two algorithms and how can I decide what algorithm to use?

Best regards,
George Fan
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________

scikit-learn mailing list

scikit-learn at python.org<mailto:scikit-learn at python.org>

https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200330/e864fb74/attachment-0001.html>

From adrin.jalali at gmail.com  Mon Mar 30 07:02:14 2020
From: adrin.jalali at gmail.com (Adrin)
Date: Mon, 30 Mar 2020 13:02:14 +0200
Subject: [scikit-learn] Monthly meetings
In-Reply-To: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com>
References: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com>
Message-ID: <CAEOrW48-M2aUZS6fPj+FHgj_F03LnjKvJRAzPDhN8ph+eCdJcA@mail.gmail.com>

Hi,

The new meeting ID:

https://anaconda.zoom.us/j/324780759?pwd=a1ROSFE2Nnc0cHBaeUtiVS93QnpHQT09

Meeting ID: 324 780 759
Password: 617892

On Thu, Mar 26, 2020 at 9:00 PM Nicolas Hug <niourf at gmail.com> wrote:

> Hi all,
>
> The next scikit-learn monthly meeting will take place on Monday (
> https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=3&day=30&hour=11&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195
> <https://www.google.com/url?q=https://www.timeanddate.com/worldclock/meetingdetails.html?year%3D2020%26month%3D3%26day%3D30%26hour%3D11%26min%3D0%26sec%3D0%26p1%3D240%26p2%3D33%26p3%3D37%26p4%3D179%26p5%3D195&sa=D&ust=1585684660288000&usg=AOvVaw0mM0e2IkergY8rlqcNlh0L>
> )
>
> While these meetings are mainly for core-devs to discuss the current
> topics, we're also happy to welcome non-core devs and other projects
> maintainers! Feel free to join.
>
>
> *Location:*
>
> Join Zoom Meeting
>
> https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09 <https://www.google.com/url?q=https://anaconda.zoom.us/j/947129165?pwd%3DdEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09&sa=D&ust=1582671384105000&usg=AOvVaw1xGYPazK6DdI3O1ejkYOTd>
>
> Meeting ID: 947 129 165
> Password: 586745
>
>
> Thanks,
> Nicolas
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200330/a75e30cc/attachment.html>

From olivier.grisel at ensta.org  Mon Mar 30 07:03:44 2020
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Mon, 30 Mar 2020 13:03:44 +0200
Subject: [scikit-learn] Monthly meetings
In-Reply-To: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com>
References: <080445a5-1230-26c2-b582-03c760d1f80e@gmail.com>
Message-ID: <CAFvE7K55exOJMm36qgPnCyVps=Rs6JoQDOVxPjhQZAcYPaQzyQ@mail.gmail.com>

I get a message for an invalid meeting id.

-- 
Olivier

From t3kcit at gmail.com  Mon Mar 30 10:30:09 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 30 Mar 2020 10:30:09 -0400
Subject: [scikit-learn] Analysis of sklearn and other python libraries
 on github by MS team
In-Reply-To: <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org>
References: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com>
 <e1c6191f-c3cd-5053-e42a-af492fd864b9@gmail.com>
 <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org>
Message-ID: <272061a7-0eda-dd2c-8666-a4be22a40e92@gmail.com>


On 3/27/20 6:20 PM, Gael Varoquaux wrote:
> Thanks for the link Andy. This is indeed very interesting!
>
> On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
>>> Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression,
>>> MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this
>>> order).
>> Maybe LinearRegression docstring should more strongly suggest to use Ridge
>> with small regularization in practice.
> Yes! I actually wonder if we should not remove LinearRegression. It's a
> bit frightening me that so many people use it. The only time that I've
> seen it used in a scientific people, it was a mistake and it shouldn't
> have been used.
>
> I seldom advocate for deprecating :).
>

People use sklearn for inference. I'm not sure we should deprecate this 
usecase even though it's not
our primary motivation.

Also, there's an inconsistency here: Logistic Regression has an L2 
penalty by default (to the annoyance of some),
while Linear Regression does not. We have discussed the meaning of the 
different classes for linear models several times,
they are certainly not consistent (ridge, lasso and lr are three classes 
for squared loss while all three are in LogisticRegression for the log 
loss).

I think to many "use statsmodels" is not a satisfying answer.

I have seen people argue that linear regression or logistic regression 
should throw an error on colinear data, and I think that's not in the 
spirit of sklearn
(even though we had this as a warning in discriminant analysis until 
recently).
But we should probably have more clear signaling about this.

Our documentation doesn't really emphasize the prediction vs inference 
point enough, I think.

Btw, we could also make our linear regression more stable by using the 
minimum norm solution via the SVD.

From t3kcit at gmail.com  Mon Mar 30 10:35:43 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 30 Mar 2020 10:35:43 -0400
Subject: [scikit-learn] Analysis of sklearn and other python libraries
 on github by MS team
In-Reply-To: <272061a7-0eda-dd2c-8666-a4be22a40e92@gmail.com>
References: <60bf6211-18f9-7408-03da-a5157c754145@gmail.com>
 <e1c6191f-c3cd-5053-e42a-af492fd864b9@gmail.com>
 <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org>
 <272061a7-0eda-dd2c-8666-a4be22a40e92@gmail.com>
Message-ID: <71befd21-75b6-6370-2416-a5b01225c492@gmail.com>

Also see https://github.com/scikit-learn/scikit-learn/issues/14268
which is discussing how to make things faster *and* more stable!


On 3/30/20 10:30 AM, Andreas Mueller wrote:
>
>
> On 3/27/20 6:20 PM, Gael Varoquaux wrote:
>> Thanks for the link Andy. This is indeed very interesting!
>>
>> On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
>>>> Regarding learners, Top-5 in both GH17 and GH19 are 
>>>> LogisticRegression,
>>>> MultinomialNB, SVC, LinearRegression, and RandomForestClassifier 
>>>> (in this
>>>> order).
>>> Maybe LinearRegression docstring should more strongly suggest to use 
>>> Ridge
>>> with small regularization in practice.
>> Yes! I actually wonder if we should not remove LinearRegression. It's a
>> bit frightening me that so many people use it. The only time that I've
>> seen it used in a scientific people, it was a mistake and it shouldn't
>> have been used.
>>
>> I seldom advocate for deprecating :).
>>
>
> People use sklearn for inference. I'm not sure we should deprecate 
> this usecase even though it's not
> our primary motivation.
>
> Also, there's an inconsistency here: Logistic Regression has an L2 
> penalty by default (to the annoyance of some),
> while Linear Regression does not. We have discussed the meaning of the 
> different classes for linear models several times,
> they are certainly not consistent (ridge, lasso and lr are three 
> classes for squared loss while all three are in LogisticRegression for 
> the log loss).
>
> I think to many "use statsmodels" is not a satisfying answer.
>
> I have seen people argue that linear regression or logistic regression 
> should throw an error on colinear data, and I think that's not in the 
> spirit of sklearn
> (even though we had this as a warning in discriminant analysis until 
> recently).
> But we should probably have more clear signaling about this.
>
> Our documentation doesn't really emphasize the prediction vs inference 
> point enough, I think.
>
> Btw, we could also make our linear regression more stable by using the 
> minimum norm solution via the SVD.


From t3kcit at gmail.com  Mon Mar 30 15:03:58 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 30 Mar 2020 15:03:58 -0400
Subject: [scikit-learn] A basic question about kmeans algorithms elkan
 and llyod
In-Reply-To: <SN6PR06MB48640615389863FA9CA52E18B3CB0@SN6PR06MB4864.namprd06.prod.outlook.com>
References: <SN6PR06MB48641F16AD566163A4E335E4B3CF0@SN6PR06MB4864.namprd06.prod.outlook.com>
 <CADeotZqb=viYLYvui6fDZEQSLP-HAyUgvvdxVdGeyJdb-7d0vA@mail.gmail.com>
 <cdc9e447-5e48-3b86-3f31-54b6c04afb56@gmail.com>
 <SN6PR06MB48640615389863FA9CA52E18B3CB0@SN6PR06MB4864.namprd06.prod.outlook.com>
Message-ID: <1982d76e-554e-0770-3eb1-e970f3a9e983@gmail.com>

sorry I thought it also did experiements on what they call "sta" but I 
guess they are not included.
The conclusion is the same, though. Different algorithms show different 
performance on different datasets.

The Yingyang k-means has some elkan vs lloyd figures:
http://proceedings.mlr.press/v37/ding15.pdf

In table 2, the Elkan row, in cases the speedup is <1, it means elkans 
is slower than lloyd.
Elkans is also more memory intensive, so you can see some missing values 
in that where the computation couldn't be performed, but lloyd could.


On 3/30/20 3:33 AM, ? ?? wrote:
>
> Hi,
>
> Thanks for your suggestion of the paper. However, the paper shows many 
> more algorithms and finds out different algorithms show different 
> performance on dataset with various dimensions, Lloyd algorithm not 
> included. What I want to know is that can we remove the Lloyd 
> algorithm in kmeans of scikit-learn since elkan is an optimized on 
> with better performance.
>
> Best regards,
>
> George
>
> *From:* scikit-learn 
> <scikit-learn-bounces+mc_george123=hotmail.com at python.org> *On Behalf 
> Of *Andreas Mueller
> *Sent:* Saturday, March 28, 2020 12:37 AM
> *To:* scikit-learn at python.org
> *Subject:* Re: [scikit-learn] A basic question about kmeans algorithms 
> elkan and llyod
>
> There's an interesting analysis in this paper:
> Fast K-Means with Accurate Bounds
>
> http://proceedings.mlr.press/v48/newling16.pdf
>
> On 3/26/20 3:40 AM, Alexandre Gramfort wrote:
>
>     hi,
>
>     I suspect Elkan is really winning when you have many centroids
>
>     so the conclusion is not systematic
>
>     my 2c
>
>     Alex
>
>     On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com
>     <mailto:MC_George123 at hotmail.com> <MC_George123 at hotmail.com
>     <mailto:MC_George123 at hotmail.com>> wrote:
>
>         Hi admins,
>
>         My team is working on optimization on scikit-learn staff now.
>         When it comes to kmeans, I find there are two algorithms, one
>         of which is lloyd and the other is elkan, which is the
>         optimized one for lloyd using triangle inequality.? In the
>         older version of scikit-learn, elkan only supports dense
>         dataset instead of sparse one. And in the latest version,
>         elkan supports both type of datasets. So there is a question
>         why both two algorithms are kept in kmeans since they do the
>         almost same thing and elkan is a optimized one for lloyd. Are
>         there any precision difference between two algorithms and how
>         can I decide what algorithm to use?
>
>         Best regards,
>
>         George Fan
>
>         _______________________________________________
>         scikit-learn mailing list
>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>         https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>     _______________________________________________
>
>     scikit-learn mailing list
>
>     scikit-learn at python.org  <mailto:scikit-learn at python.org>
>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200330/7d00c2b9/attachment.html>

From MC_George123 at hotmail.com  Tue Mar 31 03:49:45 2020
From: MC_George123 at hotmail.com (=?utf-8?B?5qiKIOS5puWNjg==?=)
Date: Tue, 31 Mar 2020 07:49:45 +0000
Subject: [scikit-learn] A basic question about kmeans algorithms elkan
 and llyod
In-Reply-To: <1982d76e-554e-0770-3eb1-e970f3a9e983@gmail.com>
References: <SN6PR06MB48641F16AD566163A4E335E4B3CF0@SN6PR06MB4864.namprd06.prod.outlook.com>
 <CADeotZqb=viYLYvui6fDZEQSLP-HAyUgvvdxVdGeyJdb-7d0vA@mail.gmail.com>
 <cdc9e447-5e48-3b86-3f31-54b6c04afb56@gmail.com>
 <SN6PR06MB48640615389863FA9CA52E18B3CB0@SN6PR06MB4864.namprd06.prod.outlook.com>
 <1982d76e-554e-0770-3eb1-e970f3a9e983@gmail.com>
Message-ID: <SN6PR06MB4864C07967C67669729ED941B3C80@SN6PR06MB4864.namprd06.prod.outlook.com>

Thank you very much for your information.

From: scikit-learn <scikit-learn-bounces+mc_george123=hotmail.com at python.org> On Behalf Of Andreas Mueller
Sent: Tuesday, March 31, 2020 3:04 AM
To: scikit-learn at python.org
Subject: Re: [scikit-learn] A basic question about kmeans algorithms elkan and llyod

sorry I thought it also did experiements on what they call "sta" but I guess they are not included.
The conclusion is the same, though. Different algorithms show different performance on different datasets.

The Yingyang k-means has some elkan vs lloyd figures:
http://proceedings.mlr.press/v37/ding15.pdf

In table 2, the Elkan row, in cases the speedup is <1, it means elkans is slower than lloyd.
Elkans is also more memory intensive, so you can see some missing values in that where the computation couldn't be performed, but lloyd could.


On 3/30/20 3:33 AM, ? ?? wrote:
Hi,

Thanks for your suggestion of the paper. However, the paper shows many more algorithms and finds out different algorithms show different performance on dataset with various dimensions, Lloyd algorithm not included. What I want to know is that can we remove the Lloyd algorithm in kmeans of scikit-learn since elkan is an optimized on with better performance.

Best regards,
George

From: scikit-learn <scikit-learn-bounces+mc_george123=hotmail.com at python.org><mailto:scikit-learn-bounces+mc_george123=hotmail.com at python.org> On Behalf Of Andreas Mueller
Sent: Saturday, March 28, 2020 12:37 AM
To: scikit-learn at python.org<mailto:scikit-learn at python.org>
Subject: Re: [scikit-learn] A basic question about kmeans algorithms elkan and llyod

There's an interesting analysis in this paper:
Fast K-Means with Accurate Bounds

http://proceedings.mlr.press/v48/newling16.pdf


On 3/26/20 3:40 AM, Alexandre Gramfort wrote:
hi,

I suspect Elkan is really winning when you have many centroids
so the conclusion is not systematic

my 2c
Alex


On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com<mailto:MC_George123 at hotmail.com> <MC_George123 at hotmail.com<mailto:MC_George123 at hotmail.com>> wrote:
Hi admins,

My team is working on optimization on scikit-learn staff now. When it comes to kmeans, I find there are two algorithms, one of which is lloyd and the other is elkan, which is the optimized one for lloyd using triangle inequality.  In the older version of scikit-learn, elkan only supports dense dataset instead of sparse one. And in the latest version, elkan supports both type of datasets. So there is a question why both two algorithms are kept in kmeans since they do the almost same thing and elkan is a optimized one for lloyd. Are there any precision difference between two algorithms and how can I decide what algorithm to use?

Best regards,
George Fan
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________

scikit-learn mailing list

scikit-learn at python.org<mailto:scikit-learn at python.org>

https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________

scikit-learn mailing list

scikit-learn at python.org<mailto:scikit-learn at python.org>

https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200331/7e502125/attachment-0001.html>

From benoit.presles at u-bourgogne.fr  Tue Mar 31 09:48:50 2020
From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=)
Date: Tue, 31 Mar 2020 15:48:50 +0200
Subject: [scikit-learn] Number of informative features vs total number of
 features
Message-ID: <10c2473f-50e3-c959-b9f7-07c2b903c840@u-bourgogne.fr>

Dear sklearn users,

I did some supervised classification simulations with the 
make_classification function from sklearn increasing the number of 
informative features from 1 out of 40 to 40 out of 40 (100%). I did not 
generate any repeated or redundant features. I fixed the number of 
classes to two and the number of clusters per class to one.

I split the dataset 100 times using the StratifiedShuffleSplit function 
into two subsets: a training set and a test set (80% - 20%). I performed 
a logistic regression and calculated training and testing accuracies and 
averaged the results over the 100 splits leading to a mean training 
accuracy and a mean testing accuracy.

I was expecting to get an increasing accuracy score as a function of 
informative features for both the training and the test sets. On the 
contrary, I have got the best training and test scores for one 
informative feature. Why do I get these results ?

Thanks for your help,
Best regards,
Ben

Below the simulation code I have written:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

RANDOM_SEED = 4
n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40])

mean_training_score_array = np.array([])
mean_testing_score_array = np.array([])
for n_inf_value in n_inf:
 ??? X, y = make_classification(n_samples=2500,
 ?????????????????????????????? n_features=40,
 ?????????????????????????????? n_informative=n_inf_value,
 ?????????????????????????????? n_redundant=0,
 ?????????????????????????????? n_repeated=0,
 ?????????????????????????????? n_classes=2,
 ?????????????????????????????? n_clusters_per_class=1,
 ?????????????????????????????? random_state=RANDOM_SEED,
 ?????????????????????????????? shuffle=False)
 ??? #
 ??? print('Simulated data - number of informative features = ' + 
str(n_inf_value))
 ??? #
 ??? sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, 
random_state=RANDOM_SEED)
 ??? training_score_array = np.array([])
 ??? testing_score_array = np.array([])
 ??? for train_index_split, test_index_split in sss.split(X, y):
 ??????? X_split_train, X_split_test = X[train_index_split], 
X[test_index_split]
 ??????? y_split_train, y_split_test = y[train_index_split], 
y[test_index_split]
 ??????? scaler = StandardScaler()
 ??????? X_split_train = scaler.fit_transform(X_split_train)
 ??????? X_split_test = scaler.transform(X_split_test)
 ??????? lr = LogisticRegression(fit_intercept=True, max_iter=1e9, 
verbose=0,
 ??????????????????????????????? random_state=RANDOM_SEED, 
solver='lbfgs', tol=1e-6, C=10)
 ??????? lr.fit(X_split_train, y_split_train)
 ??????? y_pred_train = lr.predict(X_split_train)
 ??????? y_pred_test = lr.predict(X_split_test)
 ??????? accuracy_train_score = accuracy_score(y_split_train, y_pred_train)
 ??????? accuracy_test_score = accuracy_score(y_split_test, y_pred_test)
 ??????? training_score_array = np.append(training_score_array, 
accuracy_train_score)
 ??????? testing_score_array = np.append(testing_score_array, 
accuracy_test_score)
 ??? mean_training_score_array = np.append(mean_training_score_array, 
np.average(training_score_array))
 ??? mean_testing_score_array = np.append(mean_testing_score_array, 
np.average(testing_score_array))
#
print('mean_training_score_array=' + str(mean_training_score_array))
print('mean_testing_score_array=' + str(mean_testing_score_array))
#
plt.plot(n_inf, mean_training_score_array, 'r', label='mean training score')
plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing score')
plt.xlabel('number of informative features out of 40')
plt.ylabel('accuracy')
plt.legend()
plt.show()