[omaha] Group Data Science Competition

Bob Haffner bob.haffner at gmail.com
Wed Jan 4 08:43:43 EST 2017


Yeah, no kidding.  That pdf wasn't hard to find and that #1 score is pretty
damn good

On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha <omaha at python.org>
wrote:

> Looks like we have our key to a score of 0.0. Lol
>
> Seriously though, does anyone wonder if the person sitting at #1 had this
> full data set as well and trained a model using the entire set? I mean that
> 0.038 score is so much better than anyone else it seems a little
> unrealistic...or maybe it's just seems that way because I haven't been able
> to break through 0.12   : )
>
>
>
>
>
> Sent from my iPhone
> > On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha <omaha at python.org>
> wrote:
> >
> > Pretty interesting notebook I put together regarding the kaggle comp
> > https://github.com/bobhaffner/kaggle-houseprices/blob/
> master/additional_training_data.ipynb
> >
> > On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha <omaha at python.org>
> > wrote:
> >
> >>> On Wednesday, December 28, 2016, Wes Turner <wes.turner at gmail.com>
> wrote:
> >>>
> >>>
> >>>
> >>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <
> >> omaha at python.org
> >>> <javascript:_e(%7B%7D,'cvml','omaha at python.org');>> wrote:
> >>>
> >>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in
> our
> >>>> score! Currently sitting at 798th place.
> >>>>
> >>>
> >>> Nice work! Features of your feature engineering I admire:
> >>>
> >>> - nominal, ordinal, continuous, discrete
> >>>  categorical = nominal + discrete
> >>>  numeric = continuous + discrete
> >>>
> >>> - outlier removal
> >>>  - [ ] w/ constant thresholding? (is there a distribution parameter)
> >>>
> >>> - building datestrings from SaleMonth and YrSold
> >>>  - SaleMonth / "1" / YrSold
> >>>   - df..drop(['MoSold','YrSold','SaleMonth'])
> >>>     - [ ] why drop SaleMonth?
> >>>  - [ ] pandas.to_datetime[df['SaleMonth'])
> >>>
> >>> - merging with FHA Home Price Index for the month and region ("West
> North
> >>> Central")
> >>>  https://www.fhfa.gov/DataTools/Downloads/Documents/
> >>> HPI/HPI_PO_monthly_hist.xls
> >>>  - [ ] pandas.to_datetime
> >>>    - this should have every month, but the new merge_asof feature is
> >>> worth mentioning
> >>>
> >>> - manual binarization
> >>>  - [ ] how did you pick these? correlation after pd.get_dummies?
> >>>  - [ ] why floats? 1.0 / 1 (does it make a difference?)
> >>>
> >>> - Ames, IA nbrhood_multiplier
> >>>  - http://www.cityofames.org/home/showdocument?id=1024
> >>>
> >>> - feature merging
> >>>  - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2
> >>>  - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath +
> >>> (HalfBath / 2.0)
> >>>  - ( ) IDK how a feature-selection pipeline could do this automatically
> >>>
> >>> - null value imputation
> >>>  - .isnull() = 0
> >>>  - ( ) datacleaner incorrectly sets these to median or mode
> >>>
> >>> - log for skewed continuous and SalePrice
> >>>  - ( ) auto_ml: take_log_of_y does this for SalePrice
> >>>
> >>> - "Keeping only the columns we want"
> >>>  - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename,
> index_col='Id')
> >>>
> >>>
> >>> - Binarization
> >>>  - pd.get_dummies(dummy_na=False)
> >>>  - [ ] (a Luke pointed out, concatenation keeps the same columns)
> >>>        rows = eng_train.shape[0]
> >>>        eng_merged = pd.concat(eng_train, eng_test)
> >>>        onehot_merged = pd.get_dummies(eng_merged, columns=nominal,
> >>> dummy_na=False)
> >>>        onehot_train = eng_merged[:rows]
> >>>        onehot_test = eng_merged[rows:]
> >>>
> >>> - class RandomSelectionHelper
> >>>  - [ ] this could be generally helpful in sklean[-pandas]
> >>>    - https://github.com/paulgb/sklearn-pandas#cross-validation
> >>>
> >>> - Models to Search
> >>>  - {Ridge, Lasso, ElasticNet}
> >>>
> >>>     - https://github.com/ClimbsRocks/auto_ml/blob/
> >>> master/auto_ml/predictor.py#L222
> >>>       _get_estimator_names ( "regressor" )
> >>>       - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor,
> >>> RandomForestRegressor, LinearRegression, AdaBoostRegressor,
> >>> ExtraTreesRegressor}
> >>>
> >>>     - https://github.com/ClimbsRocks/auto_ml/blob/
> >>> master/auto_ml/predictor.py#L491
> >>>       - (w/ ensembling)
> >>>       -  ['RandomForestRegressor', 'LinearRegression',
> >>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor',
> >>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars',
> >>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + ['
> >>> XGBRegressor']
> >>>
> >>> - model stacking / ensembling
> >>>
> >>>  - ( ) auto_ml: https://auto-ml.readthedocs.
> >> io/en/latest/ensembling.html
> >>>  - ( ) auto-sklearn:
> >>>        https://automl.github.io/auto-sklearn/stable/api.html#
> >>> autosklearn.regression.AutoSklearnRegressor
> >>>        ensemble_size=50, ensemble_nbest=50
> >>>
> >>
> >> https://en.wikipedia.org/wiki/Ensemble_learning
> >>
> >> http://www.scholarpedia.org/article/Ensemble_learning#
> >> Ensemble_combination_rules
> >>
> >>
> >>>
> >>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x:
> >>> np.exp(x))
> >>>
> >>>  - [ ] What is this called / how does this work?
> >>>    - https://docs.scipy.org/doc/numpy/reference/generated/
> >> numpy.exp.html
> >>>
> >>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also
> works
> >>>  - http://pandas.pydata.org/pandas-docs/stable/generated/
> >>> pandas.DataFrame.to_csv.html
> >>>
> >>>
> >>>
> >>>> My notebook is on GitHub for those interested:
> >>>>
> >>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/
> master/attempt_4
> >>>
> >>>
> >>> Thanks!
> >>>
> >>
> >> (Trimmed for 40K limit)
> >> _______________________________________________
> >> Omaha Python Users Group mailing list
> >> Omaha at python.org
> >> https://mail.python.org/mailman/listinfo/omaha
> >> http://www.OmahaPython.org
> >>
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > https://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
>
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>


More information about the Omaha mailing list