[omaha] Group Data Science Competition

Wed Dec 28 10:36:17 EST 2016

leaderboard update
https://github.com/bobhaffner/kaggle-houseprices/blob/master/kaggle_house_prices_leaderboard.ipynb

On Wed, Dec 28, 2016 at 8:53 AM, Luke Schollmeyer via Omaha <
omaha at python.org> wrote:

> Nice feature engineering!. Really gets in to the "buyer mindset" for
> helping reward higher prices and punish lower prices. Good job using some
> external data.
>
> Luke
>
> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <omaha at python.org
> >
> wrote:
>
> > Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our
> > score! Currently sitting at 798th place.
> >
> > My notebook is on GitHub for those interested:
> >
> > https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4
> >
> > Jeremy
> >
> >
> > > On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha <omaha at python.org>
> > wrote:
> > >
> > >> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner <wes.turner at gmail.com>
> > wrote:
> > >>
> > >>
> > >>
> > >>> On Sunday, December 25, 2016, Bob Haffner <bob.haffner at gmail.com>
> > wrote:
> > >>>
> > >>> Merry Christmas, everyone!
> > >>>
> > >>>
> > >> Merry Christmas!
> > >>
> > >>>
> > >>> Still heading down the TPOT path with limited success.  I get varying
> > >>> scores (tpot.score()) with the same result (kaggle scoring)
> > >>>
> > >>> Any other TPOT users getting inconsistent results?   Specifically
> with
> > >>> 0.6.7?
> > >>>
> > >>
> > >> There may be variance because of the way TPOT splits X_train into
> > X_train
> > >> and X_test w/ train_size and test_size.
> > >>
> > >> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic
> > >> complexity score with a concatenation step so that X_train and X_test
> > have
> > >> the same columns (in data.py)
> > >>
> > >> It probably makes sense to use scikit-learn for data transformation
> > (e.g.
> > >> OneHotEncoder instead of get_dummies).
> > >>
> > >> https://twitter.com/westurner/status/813011289475842048 :
> > >> """
> > >> . at scikit_learn
> > >> Src: https://t.co/biMt6XRt2T
> > >> Docs: https://t.co/Lb5EYRCdI8
> > >> #API:
> > >> .fit_transform(X, y)
> > >> .fit(X_train, y_train)
> > >> .predict(X_test)
> > >> """
> > >>
> > >> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may
> > not
> > >> prevent the oom error.
> > >>
> > >> Looking at https://libraries.io/pypi/xgboost "Dependent
> Repositories",
> > >> there are a number of scikit-learn-compatible packages for automating
> > >> analysis in addition to TPOT: auto-sklearn, rep.
> > >> auto_ml mentions 12 algos for type_of_estimator='regressor'.
> > >> (and sparse matrices, and other parameters).
> > >>
> > >> https://github.com/ClimbsRocks/auto_ml
> > >>
> > >> http://auto-ml.readthedocs.io/en/latest/
> > >>
> > >
> > > Here's a (probably overfitted) auto_ml attempt:
> > > https://github.com/westurner/house_prices/blob/
> > 7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/
> analysis_auto_ml.py
> > >
> > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
> > leaderboard?submissionId=3958857
> > > ..."Your submission scored 9.45422, which is not an improvement of your
> > > best score. "
> > >
> > > Setting .train(compute_power=10) errored out after a bunch of
> > GridSearchCV.
> > >
> > >
> > >>
> > >>
> > >> I should be able to generate column_descriptions from
> parse_description
> > in
> > >> data.py:
> > >> https://github.com/westurner/house_prices/blob/develop/
> > >> house_prices/data.py
> > >>
> > >> https://github.com/automl/auto-sklearn looks cool too.
> > >>
> > >> ... http://stats.stackexchange.com/questions/181/how-to-
> > choose-the-number-
> > >> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
> > >>
> > >> http://tflearn.org
> > >>
> > >>
> > >>>
> > >>>
> > >>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer <
> > >>> luke.schollmeyer at gmail.com> wrote:
> > >>>
> > >>>> Moved the needle a little bit yesterday with a ridge regression
> > attempt
> > >>>> using the same feature engineering I described before.
> > >>>>
> > >>>> Luke
> > >>>>
> > >>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner <bob.haffner at gmail.com
> >
> > >>>> wrote:
> > >>>>
> > >>>>> Made a TPOT attempt tonight.  Could only do some numeric features
> > >>>>> though because including categoricals would cause my ipython kernel
> > to die.
> > >>>>>
> > >>>>> I will try a bigger box this weekend
> > >>>>>
> > >>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha <
> > omaha at python.org
> > >>>>>> wrote:
> > >>>>>
> > >>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner <wes.turner at gmail.com
> >
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner <
> wes.turner at gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer <
> > >>>>>>>> luke.schollmeyer at gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>>> The quick explanation is rather than dropping outliers, I used
> > >>>>>> numpy's
> > >>>>>>>>> log1p function to help normalize distribution of the data (for
> > >>>>>> both the
> > >>>>>>>>> sale price and for all features over a certain skewness). I was
> > >>>>>> also
> > >>>>>>>>> struggling with adding in more features to the model.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l
> > >>>>>> og1p.html
> > >>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn.
> > >>>>>>>> preprocessing.FunctionTransformer.html
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic
> > >>>>>>>> s)#Common_transformations
> > >>>>>>>>
> > >>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> How did you determine the skewness threshold?
> > >>>>>>>>
> > >>>>>>>> ...
> > >>>>>>>>
> > >>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di
> > >>>>>>>> stribution#Specified_variance:_the_normal_distribution
> > >>>>>>>>
> > >>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics)
> > >>>>>>>>
> > >>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no
> > >>>>>> rmalization
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we-
> > >>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network
> > >>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz
> > >>>>>> ation-and-
> > >>>>>>> standardization-in-neural-networks
> > >>>>>>>
> > >>>>>>
> > >>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf
> > >>>>>> low/contrib/learn/python/learn
> > >>>>>>
> > >>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> The training and test data sets have different "completeness"
> of
> > >>>>>> some
> > >>>>>>>>> features, and using pd.get_dummies can be problematic when you
> > fit
> > >>>>>> a model
> > >>>>>>>>> versus predicting if you don't have the same columns/features.
> I
> > >>>>>> simply
> > >>>>>>>>> combined the train and test data sets (without the Id and
> > >>>>>> SalePrice) and
> > >>>>>>>>> ran the get_dummies function over that set.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> autoclean_cv loads the train set first and then applies those
> > >>>>>>>> categorical/numerical mappings to the test set
> > >>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
> > >>>>>>>>
> > >>>>>>>> When I modify load_house_prices [1] to also load test.csv in
> order
> > >>>>>> to
> > >>>>>>>> autoclean_csv,
> > >>>>>>>> I might try assigning the categorical levels according to the
> > >>>>>> ranking in
> > >>>>>>>> data_description.txt,
> > >>>>>>>> rather than the happenstance ordering in train.csv;
> > >>>>>>>> though get_dummies should make that irrelevant.
> > >>>>>>>>
> > >>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous
> > >>>>>>>> e_prices/data.py#L45
> > >>>>>>>>
> > >>>>>>>> I should probably also manually specify that 'Id' is the index
> > >>>>>> column in
> > >>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas
> should
> > >>>>>> check
> > >>>>>>>> for).
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> When I needed to fit the model, I just "unraveled" the combined
> > >>>>>> set with
> > >>>>>>>>> the train and test parts.
> > >>>>>>>>>
> > >>>>>>>>> combined = pd.concat((train.loc[:,'
> MSSubClass':'SaleCondition'],
> > >>>>>>>>>                      test.loc[:,'MSSubClass':'
> SaleCondition']))
> > >>>>>>>>>
> > >>>>>>>>> combined = pd.get_dummies(combined)
> > >>>>>>>>>
> > >>>>>>>>> ::: do some feature engineering :::
> > >>>>>>>>>
> > >>>>>>>>> trainX = combined[:train.shape[0]]
> > >>>>>>>>> y = train['SalePrice']
> > >>>>>>>>>
> > >>>>>>>>> Just so long you don't do anything to the combined dataframe
> > (like
> > >>>>>>>>> sorting), you can slice off each part based on it's shape.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html#
> > >>>>>>>> returning-a-view-versus-a-copy
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> and when you would be pulling the data to predict the test
> data,
> > >>>>>> you get
> > >>>>>>>>> the other part:
> > >>>>>>>>>
> > >>>>>>>>> testX = combined[train.shape[0]:]
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Why is the concatenation necessary?
> > >>>>>>>> - log1p doesn't need the whole column
> > >>>>>>>> - get_dummies doesn't need the whole column
> > >>>>>>>>
> > >>>>>>>
> > >>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre
> > >>>>>> processing.StandardScaler.html
> > >>>>>> requires the whole column.
> > >>>>>>
> > >>>>>> (
> > >>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr
> > >>>>>> eprocessing-scaler
> > >>>>>> )
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Luke
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>> (Trimmed reply-chain (again) because 40Kb limit)
> > >>>>>>>
> > >>>>>>>
> > >>>>>> _______________________________________________
> > >>>>>> Omaha Python Users Group mailing list
> > >>>>>> Omaha at python.org
> > >>>>>> https://mail.python.org/mailman/listinfo/omaha
> > >>>>>> http://www.OmahaPython.org
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > > _______________________________________________
> > > Omaha Python Users Group mailing list
> > > Omaha at python.org
> > > https://mail.python.org/mailman/listinfo/omaha
> > > http://www.OmahaPython.org
> >
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > https://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
> >
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>