[omaha] Group Data Science Competition

Tue Jan 3 23:41:17 EST 2017

Looks like we have our key to a score of 0.0. Lol

Seriously though, does anyone wonder if the person sitting at #1 had this full data set as well and trained a model using the entire set? I mean that 0.038 score is so much better than anyone else it seems a little unrealistic...or maybe it's just seems that way because I haven't been able to break through 0.12   : )

Sent from my iPhone
> On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha <omaha at python.org> wrote:
> 
> Pretty interesting notebook I put together regarding the kaggle comp
> https://github.com/bobhaffner/kaggle-houseprices/blob/master/additional_training_data.ipynb
> 
> On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha <omaha at python.org>
> wrote:
> 
>>> On Wednesday, December 28, 2016, Wes Turner <wes.turner at gmail.com> wrote:
>>> 
>>> 
>>> 
>>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <
>> omaha at python.org
>>> <javascript:_e(%7B%7D,'cvml','omaha at python.org');>> wrote:
>>> 
>>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our
>>>> score! Currently sitting at 798th place.
>>>> 
>>> 
>>> Nice work! Features of your feature engineering I admire:
>>> 
>>> - nominal, ordinal, continuous, discrete
>>>  categorical = nominal + discrete
>>>  numeric = continuous + discrete
>>> 
>>> - outlier removal
>>>  - [ ] w/ constant thresholding? (is there a distribution parameter)
>>> 
>>> - building datestrings from SaleMonth and YrSold
>>>  - SaleMonth / "1" / YrSold
>>>   - df..drop(['MoSold','YrSold','SaleMonth'])
>>>     - [ ] why drop SaleMonth?
>>>  - [ ] pandas.to_datetime[df['SaleMonth'])
>>> 
>>> - merging with FHA Home Price Index for the month and region ("West North
>>> Central")
>>>  https://www.fhfa.gov/DataTools/Downloads/Documents/
>>> HPI/HPI_PO_monthly_hist.xls
>>>  - [ ] pandas.to_datetime
>>>    - this should have every month, but the new merge_asof feature is
>>> worth mentioning
>>> 
>>> - manual binarization
>>>  - [ ] how did you pick these? correlation after pd.get_dummies?
>>>  - [ ] why floats? 1.0 / 1 (does it make a difference?)
>>> 
>>> - Ames, IA nbrhood_multiplier
>>>  - http://www.cityofames.org/home/showdocument?id=1024
>>> 
>>> - feature merging
>>>  - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2
>>>  - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath +
>>> (HalfBath / 2.0)
>>>  - ( ) IDK how a feature-selection pipeline could do this automatically
>>> 
>>> - null value imputation
>>>  - .isnull() = 0
>>>  - ( ) datacleaner incorrectly sets these to median or mode
>>> 
>>> - log for skewed continuous and SalePrice
>>>  - ( ) auto_ml: take_log_of_y does this for SalePrice
>>> 
>>> - "Keeping only the columns we want"
>>>  - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, index_col='Id')
>>> 
>>> 
>>> - Binarization
>>>  - pd.get_dummies(dummy_na=False)
>>>  - [ ] (a Luke pointed out, concatenation keeps the same columns)
>>>        rows = eng_train.shape[0]
>>>        eng_merged = pd.concat(eng_train, eng_test)
>>>        onehot_merged = pd.get_dummies(eng_merged, columns=nominal,
>>> dummy_na=False)
>>>        onehot_train = eng_merged[:rows]
>>>        onehot_test = eng_merged[rows:]
>>> 
>>> - class RandomSelectionHelper
>>>  - [ ] this could be generally helpful in sklean[-pandas]
>>>    - https://github.com/paulgb/sklearn-pandas#cross-validation
>>> 
>>> - Models to Search
>>>  - {Ridge, Lasso, ElasticNet}
>>> 
>>>     - https://github.com/ClimbsRocks/auto_ml/blob/
>>> master/auto_ml/predictor.py#L222
>>>       _get_estimator_names ( "regressor" )
>>>       - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor,
>>> RandomForestRegressor, LinearRegression, AdaBoostRegressor,
>>> ExtraTreesRegressor}
>>> 
>>>     - https://github.com/ClimbsRocks/auto_ml/blob/
>>> master/auto_ml/predictor.py#L491
>>>       - (w/ ensembling)
>>>       -  ['RandomForestRegressor', 'LinearRegression',
>>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor',
>>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars',
>>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + ['
>>> XGBRegressor']
>>> 
>>> - model stacking / ensembling
>>> 
>>>  - ( ) auto_ml: https://auto-ml.readthedocs.
>> io/en/latest/ensembling.html
>>>  - ( ) auto-sklearn:
>>>        https://automl.github.io/auto-sklearn/stable/api.html#
>>> autosklearn.regression.AutoSklearnRegressor
>>>        ensemble_size=50, ensemble_nbest=50
>>> 
>> 
>> https://en.wikipedia.org/wiki/Ensemble_learning
>> 
>> http://www.scholarpedia.org/article/Ensemble_learning#
>> Ensemble_combination_rules
>> 
>> 
>>> 
>>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x:
>>> np.exp(x))
>>> 
>>>  - [ ] What is this called / how does this work?
>>>    - https://docs.scipy.org/doc/numpy/reference/generated/
>> numpy.exp.html
>>> 
>>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also works
>>>  - http://pandas.pydata.org/pandas-docs/stable/generated/
>>> pandas.DataFrame.to_csv.html
>>> 
>>> 
>>> 
>>>> My notebook is on GitHub for those interested:
>>>> 
>>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4
>>> 
>>> 
>>> Thanks!
>>> 
>> 
>> (Trimmed for 40K limit)
>> _______________________________________________
>> Omaha Python Users Group mailing list
>> Omaha at python.org
>> https://mail.python.org/mailman/listinfo/omaha
>> http://www.OmahaPython.org
>> 
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org