[omaha] Group Data Science Competition
Jeremy Doyle
uiab1638 at yahoo.com
Tue Jan 3 23:41:17 EST 2017
Looks like we have our key to a score of 0.0. Lol
Seriously though, does anyone wonder if the person sitting at #1 had this full data set as well and trained a model using the entire set? I mean that 0.038 score is so much better than anyone else it seems a little unrealistic...or maybe it's just seems that way because I haven't been able to break through 0.12 : )
Sent from my iPhone
> On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha <omaha at python.org> wrote:
>
> Pretty interesting notebook I put together regarding the kaggle comp
> https://github.com/bobhaffner/kaggle-houseprices/blob/master/additional_training_data.ipynb
>
> On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha <omaha at python.org>
> wrote:
>
>>> On Wednesday, December 28, 2016, Wes Turner <wes.turner at gmail.com> wrote:
>>>
>>>
>>>
>>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <
>> omaha at python.org
>>> <javascript:_e(%7B%7D,'cvml','omaha at python.org');>> wrote:
>>>
>>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our
>>>> score! Currently sitting at 798th place.
>>>>
>>>
>>> Nice work! Features of your feature engineering I admire:
>>>
>>> - nominal, ordinal, continuous, discrete
>>> categorical = nominal + discrete
>>> numeric = continuous + discrete
>>>
>>> - outlier removal
>>> - [ ] w/ constant thresholding? (is there a distribution parameter)
>>>
>>> - building datestrings from SaleMonth and YrSold
>>> - SaleMonth / "1" / YrSold
>>> - df..drop(['MoSold','YrSold','SaleMonth'])
>>> - [ ] why drop SaleMonth?
>>> - [ ] pandas.to_datetime[df['SaleMonth'])
>>>
>>> - merging with FHA Home Price Index for the month and region ("West North
>>> Central")
>>> https://www.fhfa.gov/DataTools/Downloads/Documents/
>>> HPI/HPI_PO_monthly_hist.xls
>>> - [ ] pandas.to_datetime
>>> - this should have every month, but the new merge_asof feature is
>>> worth mentioning
>>>
>>> - manual binarization
>>> - [ ] how did you pick these? correlation after pd.get_dummies?
>>> - [ ] why floats? 1.0 / 1 (does it make a difference?)
>>>
>>> - Ames, IA nbrhood_multiplier
>>> - http://www.cityofames.org/home/showdocument?id=1024
>>>
>>> - feature merging
>>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2
>>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath +
>>> (HalfBath / 2.0)
>>> - ( ) IDK how a feature-selection pipeline could do this automatically
>>>
>>> - null value imputation
>>> - .isnull() = 0
>>> - ( ) datacleaner incorrectly sets these to median or mode
>>>
>>> - log for skewed continuous and SalePrice
>>> - ( ) auto_ml: take_log_of_y does this for SalePrice
>>>
>>> - "Keeping only the columns we want"
>>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, index_col='Id')
>>>
>>>
>>> - Binarization
>>> - pd.get_dummies(dummy_na=False)
>>> - [ ] (a Luke pointed out, concatenation keeps the same columns)
>>> rows = eng_train.shape[0]
>>> eng_merged = pd.concat(eng_train, eng_test)
>>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal,
>>> dummy_na=False)
>>> onehot_train = eng_merged[:rows]
>>> onehot_test = eng_merged[rows:]
>>>
>>> - class RandomSelectionHelper
>>> - [ ] this could be generally helpful in sklean[-pandas]
>>> - https://github.com/paulgb/sklearn-pandas#cross-validation
>>>
>>> - Models to Search
>>> - {Ridge, Lasso, ElasticNet}
>>>
>>> - https://github.com/ClimbsRocks/auto_ml/blob/
>>> master/auto_ml/predictor.py#L222
>>> _get_estimator_names ( "regressor" )
>>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor,
>>> RandomForestRegressor, LinearRegression, AdaBoostRegressor,
>>> ExtraTreesRegressor}
>>>
>>> - https://github.com/ClimbsRocks/auto_ml/blob/
>>> master/auto_ml/predictor.py#L491
>>> - (w/ ensembling)
>>> - ['RandomForestRegressor', 'LinearRegression',
>>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor',
>>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars',
>>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + ['
>>> XGBRegressor']
>>>
>>> - model stacking / ensembling
>>>
>>> - ( ) auto_ml: https://auto-ml.readthedocs.
>> io/en/latest/ensembling.html
>>> - ( ) auto-sklearn:
>>> https://automl.github.io/auto-sklearn/stable/api.html#
>>> autosklearn.regression.AutoSklearnRegressor
>>> ensemble_size=50, ensemble_nbest=50
>>>
>>
>> https://en.wikipedia.org/wiki/Ensemble_learning
>>
>> http://www.scholarpedia.org/article/Ensemble_learning#
>> Ensemble_combination_rules
>>
>>
>>>
>>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x:
>>> np.exp(x))
>>>
>>> - [ ] What is this called / how does this work?
>>> - https://docs.scipy.org/doc/numpy/reference/generated/
>> numpy.exp.html
>>>
>>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also works
>>> - http://pandas.pydata.org/pandas-docs/stable/generated/
>>> pandas.DataFrame.to_csv.html
>>>
>>>
>>>
>>>> My notebook is on GitHub for those interested:
>>>>
>>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4
>>>
>>>
>>> Thanks!
>>>
>>
>> (Trimmed for 40K limit)
>> _______________________________________________
>> Omaha Python Users Group mailing list
>> Omaha at python.org
>> https://mail.python.org/mailman/listinfo/omaha
>> http://www.OmahaPython.org
>>
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
More information about the Omaha
mailing list