[omaha] Group Data Science Competition

Tue Jan 3 20:51:05 EST 2017

Pretty interesting notebook I put together regarding the kaggle comp
https://github.com/bobhaffner/kaggle-houseprices/blob/master/additional_training_data.ipynb

On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha <omaha at python.org>
wrote:

> On Wednesday, December 28, 2016, Wes Turner <wes.turner at gmail.com> wrote:
>
> >
> >
> > On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <
> omaha at python.org
> > <javascript:_e(%7B%7D,'cvml','omaha at python.org');>> wrote:
> >
> >> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our
> >> score! Currently sitting at 798th place.
> >>
> >
> > Nice work! Features of your feature engineering I admire:
> >
> > - nominal, ordinal, continuous, discrete
> >   categorical = nominal + discrete
> >   numeric = continuous + discrete
> >
> > - outlier removal
> >   - [ ] w/ constant thresholding? (is there a distribution parameter)
> >
> > - building datestrings from SaleMonth and YrSold
> >   - SaleMonth / "1" / YrSold
> >    - df..drop(['MoSold','YrSold','SaleMonth'])
> >      - [ ] why drop SaleMonth?
> >   - [ ] pandas.to_datetime[df['SaleMonth'])
> >
> > - merging with FHA Home Price Index for the month and region ("West North
> > Central")
> >   https://www.fhfa.gov/DataTools/Downloads/Documents/
> > HPI/HPI_PO_monthly_hist.xls
> >   - [ ] pandas.to_datetime
> >     - this should have every month, but the new merge_asof feature is
> > worth mentioning
> >
> > - manual binarization
> >   - [ ] how did you pick these? correlation after pd.get_dummies?
> >   - [ ] why floats? 1.0 / 1 (does it make a difference?)
> >
> > - Ames, IA nbrhood_multiplier
> >   - http://www.cityofames.org/home/showdocument?id=1024
> >
> > - feature merging
> >   - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2
> >   - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath +
> > (HalfBath / 2.0)
> >   - ( ) IDK how a feature-selection pipeline could do this automatically
> >
> > - null value imputation
> >   - .isnull() = 0
> >   - ( ) datacleaner incorrectly sets these to median or mode
> >
> > - log for skewed continuous and SalePrice
> >   - ( ) auto_ml: take_log_of_y does this for SalePrice
> >
> > - "Keeping only the columns we want"
> >   - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, index_col='Id')
> >
> >
> > - Binarization
> >   - pd.get_dummies(dummy_na=False)
> >   - [ ] (a Luke pointed out, concatenation keeps the same columns)
> >         rows = eng_train.shape[0]
> >         eng_merged = pd.concat(eng_train, eng_test)
> >         onehot_merged = pd.get_dummies(eng_merged, columns=nominal,
> > dummy_na=False)
> >         onehot_train = eng_merged[:rows]
> >         onehot_test = eng_merged[rows:]
> >
> > - class RandomSelectionHelper
> >   - [ ] this could be generally helpful in sklean[-pandas]
> >     - https://github.com/paulgb/sklearn-pandas#cross-validation
> >
> > - Models to Search
> >   - {Ridge, Lasso, ElasticNet}
> >
> >      - https://github.com/ClimbsRocks/auto_ml/blob/
> > master/auto_ml/predictor.py#L222
> >        _get_estimator_names ( "regressor" )
> >        - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor,
> > RandomForestRegressor, LinearRegression, AdaBoostRegressor,
> > ExtraTreesRegressor}
> >
> >      - https://github.com/ClimbsRocks/auto_ml/blob/
> > master/auto_ml/predictor.py#L491
> >        - (w/ ensembling)
> >        -  ['RandomForestRegressor', 'LinearRegression',
> > 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor',
> > 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars',
> > 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + ['
> > XGBRegressor']
> >
> > - model stacking / ensembling
> >
> >   - ( ) auto_ml: https://auto-ml.readthedocs.
> io/en/latest/ensembling.html
> >   - ( ) auto-sklearn:
> >         https://automl.github.io/auto-sklearn/stable/api.html#
> > autosklearn.regression.AutoSklearnRegressor
> >         ensemble_size=50, ensemble_nbest=50
> >
>
> https://en.wikipedia.org/wiki/Ensemble_learning
>
> http://www.scholarpedia.org/article/Ensemble_learning#
> Ensemble_combination_rules
>
>
> >
> > - submission['SalePrice'] = submission.SalePrice.apply(lambda x:
> > np.exp(x))
> >
> >   - [ ] What is this called / how does this work?
> >     - https://docs.scipy.org/doc/numpy/reference/generated/
> numpy.exp.html
> >
> > - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also works
> >   - http://pandas.pydata.org/pandas-docs/stable/generated/
> > pandas.DataFrame.to_csv.html
> >
> >
> >
> >> My notebook is on GitHub for those interested:
> >>
> >> https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4
> >
> >
> > Thanks!
> >
>
> (Trimmed for 40K limit)
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>