[scikit-learn] Differences between scikit-learn and Spark.ml for regression toy problem

Wed Mar 15 23:57:35 EDT 2017

sklearn's (and hence liblinear's) intercept is not being used here, but a
feature is added in Python to represent the bias, so it's being regularised
in any case.

On 16 March 2017 at 14:27, Sebastian Raschka <se.raschka at gmail.com> wrote:

> I think the liblinear solver (default in LogisticRegression) does
> regularize the bias. So, even if both solutions (sklearn, spark) anneal to
> the global cost optimum, the model parameters would be different.
> Maybe a better way to make that comparison would be to turn off
> regularization completely for now. And when you run the LogisticRegression,
> maybe run it multiple times with different random seeds to see if your
> solutions are generally stable.
>
> Best,
> Sebastian
>
> > On Mar 13, 2017, at 1:06 PM, Stuart Reynolds <stuart at stuartreynolds.net>
> wrote:
> >
> > Both libraries are heavily parameterized. You should check what the
> defaults are for both.
> >
> > Some ideas:
> >  - What regularization is being used. L1/L2?
> >  - Does the regularization parameter have the same interpretation? 1/C =
> lambda? Some libraries use C. Some use lambda.
> >  - Also, some libraries regularize the intercept (scikit), other do not.
> (It doesn't seem like a particularly good idea to regularize the intercept
> if your optimizer permits not doing it).
> >
> >
> >
> > On Sun, Mar 12, 2017 at 7:07 PM, Frank Astier via scikit-learn <
> scikit-learn at python.org> wrote:
> > (this was also posted to stackoverflow on 03/10)
> >
> > I am setting up a very simple logistic regression problem in
> scikit-learn and in spark.ml, and the results diverge: the models they
> learn are different, but I can't figure out why (data is the same, model
> type is the same, regularization is the same...).
> >
> > No doubt I am missing some setting on one side or the other. Which
> setting? How should I set up either scikit or spark.ml to find the same
> model as its counterpart?
> >
> > I give the sklearn code and spark.ml code below. Both should be ready
> to cut-and-paste and run.
> >
> > scikit-learn code:
> > ----------------------
> >
> >     import numpy as np
> >     from sklearn.linear_model import LogisticRegression, Ridge
> >
> >     X = np.array([
> >         [-0.7306653538519616, 0.0],
> >         [0.6750417712898752, -0.4232874171873786],
> >         [0.1863463229359709, -0.8163423997075965],
> >         [-0.6719842051493347, 0.0],
> >         [0.9699938346531928, 0.0],
> >         [0.22759406190283604, 0.0],
> >         [0.9688721028330911, 0.0],
> >         [0.5993795346650845, 0.0],
> >         [0.9219423508390701, -0.8972778242305388],
> >         [0.7006904841584055, -0.5607635619919824]
> >     ])
> >
> >     y = np.array([
> >         0.0,
> >         1.0,
> >         1.0,
> >         0.0,
> >         1.0,
> >         1.0,
> >         1.0,
> >         0.0,
> >         0.0,
> >         0.0
> >     ])
> >
> >     m, n = X.shape
> >
> >     # Add intercept term to simulate inputs to GameEstimator
> >     X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis]))
> >
> >     l = 0.3
> >     e = LogisticRegression(
> >         fit_intercept=False,
> >         penalty='l2',
> >         C=1/l,
> >         max_iter=100,
> >         tol=1e-11)
> >
> >     e.fit(X_with_intercept, y)
> >
> >     print e.coef_
> >     # => [[ 0.98662189  0.45571052 -0.23467255]]
> >
> >     # Linear regression is called Ridge in sklearn
> >     e = Ridge(
> >         fit_intercept=False,
> >         alpha=l,
> >         max_iter=100,
> >         tol=1e-11)
> >
> >     e.fit(X_with_intercept, y)
> >
> >     print e.coef_
> >     # =>[ 0.32155545  0.17904355  0.41222418]
> >
> > spark.ml code:
> > -------------------
> >
> >     import org.apache.spark.{SparkConf, SparkContext}
> >     import org.apache.spark.ml.classification.LogisticRegression
> >     import org.apache.spark.ml.regression.LinearRegression
> >     import org.apache.spark.mllib.linalg.Vectors
> >     import org.apache.spark.mllib.regression.LabeledPoint
> >     import org.apache.spark.sql.SQLContext
> >
> >     object TestSparkRegression {
> >       def main(args: Array[String]): Unit = {
> >         import org.apache.log4j.{Level, Logger}
> >
> >         Logger.getLogger("org").setLevel(Level.OFF)
> >         Logger.getLogger("akka").setLevel(Level.OFF)
> >
> >         val conf = new SparkConf().setAppName("test").setMaster("local")
> >         val sc = new SparkContext(conf)
> >
> >         val sparkTrainingData = new SQLContext(sc)
> >           .createDataFrame(Seq(
> >             LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)),
> >             LabeledPoint(1.0, Vectors.dense(0.6750417712898752,
> -0.4232874171873786)),
> >             LabeledPoint(1.0, Vectors.dense(0.1863463229359709,
> -0.8163423997075965)),
> >             LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)),
> >             LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)),
> >             LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)),
> >             LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)),
> >             LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)),
> >             LabeledPoint(0.0, Vectors.dense(0.9219423508390701,
> -0.8972778242305388)),
> >             LabeledPoint(0.0, Vectors.dense(0.7006904841584055,
> -0.5607635619919824))))
> >           .toDF("label", "features")
> >
> >         val logisticModel = new LogisticRegression()
> >           .setRegParam(0.3)
> >           .setLabelCol("label")
> >           .setFeaturesCol("features")
> >           .fit(sparkTrainingData)
> >
> >         println(s"Spark logistic model coefficients:
> ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}")
> >         // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713]
> Intercept: -0.13897955358689987
> >
> >         val linearModel = new LinearRegression()
> >           .setRegParam(0.3)
> >           .setLabelCol("label")
> >           .setFeaturesCol("features")
> >           .setSolver("l-bfgs")
> >           .fit(sparkTrainingData)
> >
> >         println(s"Spark linear model coefficients:
> ${linearModel.coefficients} Intercept: ${linearModel.intercept}")
> >         // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802]
> Intercept: 0.45464906876832323
> >
> >         sc.stop()
> >       }
> >     }
> >
> > Thanks,
> >
> > Frank
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170316/be8eaf92/attachment-0001.html>