[omaha] Group Data Science Competition

Wed Dec 28 08:22:39 EST 2016

Nice job, Jeremy!  We're in the triple digits!!

On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <omaha at python.org>
wrote:

> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our
> score! Currently sitting at 798th place.
>
> My notebook is on GitHub for those interested:
>
> https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4
>
> Jeremy
>
>
> > On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha <omaha at python.org>
> wrote:
> >
> >> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner <wes.turner at gmail.com>
> wrote:
> >>
> >>
> >>
> >>> On Sunday, December 25, 2016, Bob Haffner <bob.haffner at gmail.com>
> wrote:
> >>>
> >>> Merry Christmas, everyone!
> >>>
> >>>
> >> Merry Christmas!
> >>
> >>>
> >>> Still heading down the TPOT path with limited success.  I get varying
> >>> scores (tpot.score()) with the same result (kaggle scoring)
> >>>
> >>> Any other TPOT users getting inconsistent results?   Specifically with
> >>> 0.6.7?
> >>>
> >>
> >> There may be variance because of the way TPOT splits X_train into
> X_train
> >> and X_test w/ train_size and test_size.
> >>
> >> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic
> >> complexity score with a concatenation step so that X_train and X_test
> have
> >> the same columns (in data.py)
> >>
> >> It probably makes sense to use scikit-learn for data transformation
> (e.g.
> >> OneHotEncoder instead of get_dummies).
> >>
> >> https://twitter.com/westurner/status/813011289475842048 :
> >> """
> >> . at scikit_learn
> >> Src: https://t.co/biMt6XRt2T
> >> Docs: https://t.co/Lb5EYRCdI8
> >> #API:
> >> .fit_transform(X, y)
> >> .fit(X_train, y_train)
> >> .predict(X_test)
> >> """
> >>
> >> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may
> not
> >> prevent the oom error.
> >>
> >> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories",
> >> there are a number of scikit-learn-compatible packages for automating
> >> analysis in addition to TPOT: auto-sklearn, rep.
> >> auto_ml mentions 12 algos for type_of_estimator='regressor'.
> >> (and sparse matrices, and other parameters).
> >>
> >> https://github.com/ClimbsRocks/auto_ml
> >>
> >> http://auto-ml.readthedocs.io/en/latest/
> >>
> >
> > Here's a (probably overfitted) auto_ml attempt:
> > https://github.com/westurner/house_prices/blob/
> 7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py
> >
> > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
> leaderboard?submissionId=3958857
> > ..."Your submission scored 9.45422, which is not an improvement of your
> > best score. "
> >
> > Setting .train(compute_power=10) errored out after a bunch of
> GridSearchCV.
> >
> >
> >>
> >>
> >> I should be able to generate column_descriptions from parse_description
> in
> >> data.py:
> >> https://github.com/westurner/house_prices/blob/develop/
> >> house_prices/data.py
> >>
> >> https://github.com/automl/auto-sklearn looks cool too.
> >>
> >> ... http://stats.stackexchange.com/questions/181/how-to-
> choose-the-number-
> >> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
> >>
> >> http://tflearn.org
> >>
> >>
> >>>
> >>>
> >>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer <
> >>> luke.schollmeyer at gmail.com> wrote:
> >>>
> >>>> Moved the needle a little bit yesterday with a ridge regression
> attempt
> >>>> using the same feature engineering I described before.
> >>>>
> >>>> Luke
> >>>>
> >>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner <bob.haffner at gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Made a TPOT attempt tonight.  Could only do some numeric features
> >>>>> though because including categoricals would cause my ipython kernel
> to die.
> >>>>>
> >>>>> I will try a bigger box this weekend
> >>>>>
> >>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha <
> omaha at python.org
> >>>>>> wrote:
> >>>>>
> >>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner <wes.turner at gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner <wes.turner at gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer <
> >>>>>>>> luke.schollmeyer at gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> The quick explanation is rather than dropping outliers, I used
> >>>>>> numpy's
> >>>>>>>>> log1p function to help normalize distribution of the data (for
> >>>>>> both the
> >>>>>>>>> sale price and for all features over a certain skewness). I was
> >>>>>> also
> >>>>>>>>> struggling with adding in more features to the model.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l
> >>>>>> og1p.html
> >>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn.
> >>>>>>>> preprocessing.FunctionTransformer.html
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic
> >>>>>>>> s)#Common_transformations
> >>>>>>>>
> >>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> How did you determine the skewness threshold?
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di
> >>>>>>>> stribution#Specified_variance:_the_normal_distribution
> >>>>>>>>
> >>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics)
> >>>>>>>>
> >>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no
> >>>>>> rmalization
> >>>>>>>>
> >>>>>>>
> >>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we-
> >>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network
> >>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz
> >>>>>> ation-and-
> >>>>>>> standardization-in-neural-networks
> >>>>>>>
> >>>>>>
> >>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf
> >>>>>> low/contrib/learn/python/learn
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> The training and test data sets have different "completeness" of
> >>>>>> some
> >>>>>>>>> features, and using pd.get_dummies can be problematic when you
> fit
> >>>>>> a model
> >>>>>>>>> versus predicting if you don't have the same columns/features. I
> >>>>>> simply
> >>>>>>>>> combined the train and test data sets (without the Id and
> >>>>>> SalePrice) and
> >>>>>>>>> ran the get_dummies function over that set.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> autoclean_cv loads the train set first and then applies those
> >>>>>>>> categorical/numerical mappings to the test set
> >>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
> >>>>>>>>
> >>>>>>>> When I modify load_house_prices [1] to also load test.csv in order
> >>>>>> to
> >>>>>>>> autoclean_csv,
> >>>>>>>> I might try assigning the categorical levels according to the
> >>>>>> ranking in
> >>>>>>>> data_description.txt,
> >>>>>>>> rather than the happenstance ordering in train.csv;
> >>>>>>>> though get_dummies should make that irrelevant.
> >>>>>>>>
> >>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous
> >>>>>>>> e_prices/data.py#L45
> >>>>>>>>
> >>>>>>>> I should probably also manually specify that 'Id' is the index
> >>>>>> column in
> >>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should
> >>>>>> check
> >>>>>>>> for).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> When I needed to fit the model, I just "unraveled" the combined
> >>>>>> set with
> >>>>>>>>> the train and test parts.
> >>>>>>>>>
> >>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
> >>>>>>>>>                      test.loc[:,'MSSubClass':'SaleCondition']))
> >>>>>>>>>
> >>>>>>>>> combined = pd.get_dummies(combined)
> >>>>>>>>>
> >>>>>>>>> ::: do some feature engineering :::
> >>>>>>>>>
> >>>>>>>>> trainX = combined[:train.shape[0]]
> >>>>>>>>> y = train['SalePrice']
> >>>>>>>>>
> >>>>>>>>> Just so long you don't do anything to the combined dataframe
> (like
> >>>>>>>>> sorting), you can slice off each part based on it's shape.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html#
> >>>>>>>> returning-a-view-versus-a-copy
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> and when you would be pulling the data to predict the test data,
> >>>>>> you get
> >>>>>>>>> the other part:
> >>>>>>>>>
> >>>>>>>>> testX = combined[train.shape[0]:]
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Why is the concatenation necessary?
> >>>>>>>> - log1p doesn't need the whole column
> >>>>>>>> - get_dummies doesn't need the whole column
> >>>>>>>>
> >>>>>>>
> >>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre
> >>>>>> processing.StandardScaler.html
> >>>>>> requires the whole column.
> >>>>>>
> >>>>>> (
> >>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr
> >>>>>> eprocessing-scaler
> >>>>>> )
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Luke
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>> (Trimmed reply-chain (again) because 40Kb limit)
> >>>>>>>
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> Omaha Python Users Group mailing list
> >>>>>> Omaha at python.org
> >>>>>> https://mail.python.org/mailman/listinfo/omaha
> >>>>>> http://www.OmahaPython.org
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > https://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
>
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>