[omaha] Group Data Science Competition

Wes Turner wes.turner at gmail.com
Wed Dec 28 13:01:19 EST 2016


On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <omaha at python.org>
wrote:

> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our
> score! Currently sitting at 798th place.
>

Nice work! Features of your feature engineering I admire:

- nominal, ordinal, continuous, discrete
  categorical = nominal + discrete
  numeric = continuous + discrete

- outlier removal
  - [ ] w/ constant thresholding? (is there a distribution parameter)

- building datestrings from SaleMonth and YrSold
  - SaleMonth / "1" / YrSold
   - df..drop(['MoSold','YrSold','SaleMonth'])
     - [ ] why drop SaleMonth?
  - [ ] pandas.to_datetime[df['SaleMonth'])

- merging with FHA Home Price Index for the month and region ("West North
Central")

https://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_PO_monthly_hist.xls
  - [ ] pandas.to_datetime
    - this should have every month, but the new merge_asof feature is worth
mentioning

- manual binarization
  - [ ] how did you pick these? correlation after pd.get_dummies?
  - [ ] why floats? 1.0 / 1 (does it make a difference?)

- Ames, IA nbrhood_multiplier
  - http://www.cityofames.org/home/showdocument?id=1024

- feature merging
  - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2
  - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + (HalfBath
/ 2.0)
  - ( ) IDK how a feature-selection pipeline could do this automatically

- null value imputation
  - .isnull() = 0
  - ( ) datacleaner incorrectly sets these to median or mode

- log for skewed continuous and SalePrice
  - ( ) auto_ml: take_log_of_y does this for SalePrice

- "Keeping only the columns we want"
  - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, index_col='Id')


- Binarization
  - pd.get_dummies(dummy_na=False)
  - [ ] (a Luke pointed out, concatenation keeps the same columns)
        rows = eng_train.shape[0]
        eng_merged = pd.concat(eng_train, eng_test)
        onehot_merged = pd.get_dummies(eng_merged, columns=nominal,
dummy_na=False)
        onehot_train = eng_merged[:rows]
        onehot_test = eng_merged[rows:]

- class RandomSelectionHelper
  - [ ] this could be generally helpful in sklean[-pandas]
    - https://github.com/paulgb/sklearn-pandas#cross-validation

- Models to Search
  - {Ridge, Lasso, ElasticNet}

     -
https://github.com/ClimbsRocks/auto_ml/blob/master/auto_ml/predictor.py#L222
       _get_estimator_names ( "regressor" )
       - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor,
RandomForestRegressor, LinearRegression, AdaBoostRegressor,
ExtraTreesRegressor}

     -
https://github.com/ClimbsRocks/auto_ml/blob/master/auto_ml/predictor.py#L491
       - (w/ ensembling)
       -  ['RandomForestRegressor', 'LinearRegression',
'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor',
'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars',
'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + ['
XGBRegressor']

- model stacking / ensembling

  - ( ) auto_ml: https://auto-ml.readthedocs.io/en/latest/ensembling.html
  - ( ) auto-sklearn:

https://automl.github.io/auto-sklearn/stable/api.html#autosklearn.regression.AutoSklearnRegressor
        ensemble_size=50, ensemble_nbest=50

- submission['SalePrice'] = submission.SalePrice.apply(lambda x: np.exp(x))

  - [ ] What is this called / how does this work?
    - https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html

- df.to_csv(filename, columns=['SalePrice'], index_label='Id') also works
  -
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html



> My notebook is on GitHub for those interested:
>
> https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4


Thanks!


>
>
> Jeremy
>
>
> > On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha <omaha at python.org>
> wrote:
> >
> >> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner <wes.turner at gmail.com>
> wrote:
> >>
> >>
> >>
> >>> On Sunday, December 25, 2016, Bob Haffner <bob.haffner at gmail.com>
> wrote:
> >>>
> >>> Merry Christmas, everyone!
> >>>
> >>>
> >> Merry Christmas!
> >>
> >>>
> >>> Still heading down the TPOT path with limited success.  I get varying
> >>> scores (tpot.score()) with the same result (kaggle scoring)
> >>>
> >>> Any other TPOT users getting inconsistent results?   Specifically with
> >>> 0.6.7?
> >>>
> >>
> >> There may be variance because of the way TPOT splits X_train into
> X_train
> >> and X_test w/ train_size and test_size.
> >>
> >> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic
> >> complexity score with a concatenation step so that X_train and X_test
> have
> >> the same columns (in data.py)
> >>
> >> It probably makes sense to use scikit-learn for data transformation
> (e.g.
> >> OneHotEncoder instead of get_dummies).
> >>
> >> https://twitter.com/westurner/status/813011289475842048 :
> >> """
> >> . at scikit_learn
> >> Src: https://t.co/biMt6XRt2T
> >> Docs: https://t.co/Lb5EYRCdI8
> >> #API:
> >> .fit_transform(X, y)
> >> .fit(X_train, y_train)
> >> .predict(X_test)
> >> """
> >>
> >> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may
> not
> >> prevent the oom error.
> >>
> >> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories",
> >> there are a number of scikit-learn-compatible packages for automating
> >> analysis in addition to TPOT: auto-sklearn, rep.
> >> auto_ml mentions 12 algos for type_of_estimator='regressor'.
> >> (and sparse matrices, and other parameters).
> >>
> >> https://github.com/ClimbsRocks/auto_ml
> >>
> >> http://auto-ml.readthedocs.io/en/latest/
> >>
> >
> > Here's a (probably overfitted) auto_ml attempt:
> > https://github.com/westurner/house_prices/blob/
> 7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py
> >
> > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
> leaderboard?submissionId=3958857
> > ..."Your submission scored 9.45422, which is not an improvement of your
> > best score. "
> >
> > Setting .train(compute_power=10) errored out after a bunch of
> GridSearchCV.
> >
> >
> >>
> >>
> >> I should be able to generate column_descriptions from parse_description
> in
> >> data.py:
> >> https://github.com/westurner/house_prices/blob/develop/
> >> house_prices/data.py
> >>
> >> https://github.com/automl/auto-sklearn looks cool too.
> >>
> >> ... http://stats.stackexchange.com/questions/181/how-to-
> choose-the-number-
> >> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
> >>
> >> http://tflearn.org
> >>
> >>
> >>>
> >>>
> >>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer <
> >>> luke.schollmeyer at gmail.com> wrote:
> >>>
> >>>> Moved the needle a little bit yesterday with a ridge regression
> attempt
> >>>> using the same feature engineering I described before.
> >>>>
> >>>> Luke
> >>>>
> >>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner <bob.haffner at gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Made a TPOT attempt tonight.  Could only do some numeric features
> >>>>> though because including categoricals would cause my ipython kernel
> to die.
> >>>>>
> >>>>> I will try a bigger box this weekend
> >>>>>
> >>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha <
> omaha at python.org
> >>>>>> wrote:
> >>>>>
> >>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner <wes.turner at gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner <wes.turner at gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer <
> >>>>>>>> luke.schollmeyer at gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> The quick explanation is rather than dropping outliers, I used
> >>>>>> numpy's
> >>>>>>>>> log1p function to help normalize distribution of the data (for
> >>>>>> both the
> >>>>>>>>> sale price and for all features over a certain skewness). I was
> >>>>>> also
> >>>>>>>>> struggling with adding in more features to the model.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l
> >>>>>> og1p.html
> >>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn.
> >>>>>>>> preprocessing.FunctionTransformer.html
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic
> >>>>>>>> s)#Common_transformations
> >>>>>>>>
> >>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> How did you determine the skewness threshold?
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di
> >>>>>>>> stribution#Specified_variance:_the_normal_distribution
> >>>>>>>>
> >>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics)
> >>>>>>>>
> >>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no
> >>>>>> rmalization
> >>>>>>>>
> >>>>>>>
> >>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we-
> >>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network
> >>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz
> >>>>>> ation-and-
> >>>>>>> standardization-in-neural-networks
> >>>>>>>
> >>>>>>
> >>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf
> >>>>>> low/contrib/learn/python/learn
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> The training and test data sets have different "completeness" of
> >>>>>> some
> >>>>>>>>> features, and using pd.get_dummies can be problematic when you
> fit
> >>>>>> a model
> >>>>>>>>> versus predicting if you don't have the same columns/features. I
> >>>>>> simply
> >>>>>>>>> combined the train and test data sets (without the Id and
> >>>>>> SalePrice) and
> >>>>>>>>> ran the get_dummies function over that set.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> autoclean_cv loads the train set first and then applies those
> >>>>>>>> categorical/numerical mappings to the test set
> >>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
> >>>>>>>>
> >>>>>>>> When I modify load_house_prices [1] to also load test.csv in order
> >>>>>> to
> >>>>>>>> autoclean_csv,
> >>>>>>>> I might try assigning the categorical levels according to the
> >>>>>> ranking in
> >>>>>>>> data_description.txt,
> >>>>>>>> rather than the happenstance ordering in train.csv;
> >>>>>>>> though get_dummies should make that irrelevant.
> >>>>>>>>
> >>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous
> >>>>>>>> e_prices/data.py#L45
> >>>>>>>>
> >>>>>>>> I should probably also manually specify that 'Id' is the index
> >>>>>> column in
> >>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should
> >>>>>> check
> >>>>>>>> for).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> When I needed to fit the model, I just "unraveled" the combined
> >>>>>> set with
> >>>>>>>>> the train and test parts.
> >>>>>>>>>
> >>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
> >>>>>>>>>                      test.loc[:,'MSSubClass':'SaleCondition']))
> >>>>>>>>>
> >>>>>>>>> combined = pd.get_dummies(combined)
> >>>>>>>>>
> >>>>>>>>> ::: do some feature engineering :::
> >>>>>>>>>
> >>>>>>>>> trainX = combined[:train.shape[0]]
> >>>>>>>>> y = train['SalePrice']
> >>>>>>>>>
> >>>>>>>>> Just so long you don't do anything to the combined dataframe
> (like
> >>>>>>>>> sorting), you can slice off each part based on it's shape.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html#
> >>>>>>>> returning-a-view-versus-a-copy
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> and when you would be pulling the data to predict the test data,
> >>>>>> you get
> >>>>>>>>> the other part:
> >>>>>>>>>
> >>>>>>>>> testX = combined[train.shape[0]:]
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Why is the concatenation necessary?
> >>>>>>>> - log1p doesn't need the whole column
> >>>>>>>> - get_dummies doesn't need the whole column
> >>>>>>>>
> >>>>>>>
> >>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre
> >>>>>> processing.StandardScaler.html
> >>>>>> requires the whole column.
> >>>>>>
> >>>>>> (
> >>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr
> >>>>>> eprocessing-scaler
> >>>>>> )
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Luke
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>> (Trimmed reply-chain (again) because 40Kb limit)
> >>>>>>>
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> Omaha Python Users Group mailing list
> >>>>>> Omaha at python.org
> >>>>>> https://mail.python.org/mailman/listinfo/omaha
> >>>>>> http://www.OmahaPython.org
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > https://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
>
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>


More information about the Omaha mailing list