[omaha] Group Data Science Competition

Wed Jan 4 00:50:09 EST 2017

... https://en.wikipedia.org/wiki/Regression_(psychology)

On Tue, Jan 3, 2017 at 11:49 PM, Wes Turner <wes.turner at gmail.com> wrote:

> https://docs.scipy.org/doc/numpy/reference/routines.
> random.html#distributions
>
> ( I haven't looked. )
>
> On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha <omaha at python.org>
> wrote:
>
>> Looks like we have our key to a score of 0.0. Lol
>>
>> Seriously though, does anyone wonder if the person sitting at #1 had this
>> full data set as well and trained a model using the entire set? I mean that
>> 0.038 score is so much better than anyone else it seems a little
>> unrealistic...or maybe it's just seems that way because I haven't been able
>> to break through 0.12   : )
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> > On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha <omaha at python.org>
>> wrote:
>> >
>> > Pretty interesting notebook I put together regarding the kaggle comp
>> > https://github.com/bobhaffner/kaggle-houseprices/blob/master
>> /additional_training_data.ipynb
>> >
>> > On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha <omaha at python.org
>> >
>> > wrote:
>> >
>> >>> On Wednesday, December 28, 2016, Wes Turner <wes.turner at gmail.com>
>> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <
>> >> omaha at python.org
>> >>> <javascript:_e(%7B%7D,'cvml','omaha at python.org');>> wrote:
>> >>>
>> >>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in
>> our
>> >>>> score! Currently sitting at 798th place.
>> >>>>
>> >>>
>> >>> Nice work! Features of your feature engineering I admire:
>> >>>
>> >>> - nominal, ordinal, continuous, discrete
>> >>>  categorical = nominal + discrete
>> >>>  numeric = continuous + discrete
>> >>>
>> >>> - outlier removal
>> >>>  - [ ] w/ constant thresholding? (is there a distribution parameter)
>> >>>
>> >>> - building datestrings from SaleMonth and YrSold
>> >>>  - SaleMonth / "1" / YrSold
>> >>>   - df..drop(['MoSold','YrSold','SaleMonth'])
>> >>>     - [ ] why drop SaleMonth?
>> >>>  - [ ] pandas.to_datetime[df['SaleMonth'])
>> >>>
>> >>> - merging with FHA Home Price Index for the month and region ("West
>> North
>> >>> Central")
>> >>>  https://www.fhfa.gov/DataTools/Downloads/Documents/
>> >>> HPI/HPI_PO_monthly_hist.xls
>> >>>  - [ ] pandas.to_datetime
>> >>>    - this should have every month, but the new merge_asof feature is
>> >>> worth mentioning
>> >>>
>> >>> - manual binarization
>> >>>  - [ ] how did you pick these? correlation after pd.get_dummies?
>> >>>  - [ ] why floats? 1.0 / 1 (does it make a difference?)
>> >>>
>> >>> - Ames, IA nbrhood_multiplier
>> >>>  - http://www.cityofames.org/home/showdocument?id=1024
>> >>>
>> >>> - feature merging
>> >>>  - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2
>> >>>  - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath +
>> >>> (HalfBath / 2.0)
>> >>>  - ( ) IDK how a feature-selection pipeline could do this
>> automatically
>> >>>
>> >>> - null value imputation
>> >>>  - .isnull() = 0
>> >>>  - ( ) datacleaner incorrectly sets these to median or mode
>> >>>
>> >>> - log for skewed continuous and SalePrice
>> >>>  - ( ) auto_ml: take_log_of_y does this for SalePrice
>> >>>
>> >>> - "Keeping only the columns we want"
>> >>>  - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename,
>> index_col='Id')
>> >>>
>> >>>
>> >>> - Binarization
>> >>>  - pd.get_dummies(dummy_na=False)
>> >>>  - [ ] (a Luke pointed out, concatenation keeps the same columns)
>> >>>        rows = eng_train.shape[0]
>> >>>        eng_merged = pd.concat(eng_train, eng_test)
>> >>>        onehot_merged = pd.get_dummies(eng_merged, columns=nominal,
>> >>> dummy_na=False)
>> >>>        onehot_train = eng_merged[:rows]
>> >>>        onehot_test = eng_merged[rows:]
>> >>>
>> >>> - class RandomSelectionHelper
>> >>>  - [ ] this could be generally helpful in sklean[-pandas]
>> >>>    - https://github.com/paulgb/sklearn-pandas#cross-validation
>> >>>
>> >>> - Models to Search
>> >>>  - {Ridge, Lasso, ElasticNet}
>> >>>
>> >>>     - https://github.com/ClimbsRocks/auto_ml/blob/
>> >>> master/auto_ml/predictor.py#L222
>> >>>       _get_estimator_names ( "regressor" )
>> >>>       - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor,
>> >>> RandomForestRegressor, LinearRegression, AdaBoostRegressor,
>> >>> ExtraTreesRegressor}
>> >>>
>> >>>     - https://github.com/ClimbsRocks/auto_ml/blob/
>> >>> master/auto_ml/predictor.py#L491
>> >>>       - (w/ ensembling)
>> >>>       -  ['RandomForestRegressor', 'LinearRegression',
>> >>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor',
>> >>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars',
>> >>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + ['
>> >>> XGBRegressor']
>> >>>
>> >>> - model stacking / ensembling
>> >>>
>> >>>  - ( ) auto_ml: https://auto-ml.readthedocs.
>> >> io/en/latest/ensembling.html
>> >>>  - ( ) auto-sklearn:
>> >>>        https://automl.github.io/auto-sklearn/stable/api.html#
>> >>> autosklearn.regression.AutoSklearnRegressor
>> >>>        ensemble_size=50, ensemble_nbest=50
>> >>>
>> >>
>> >> https://en.wikipedia.org/wiki/Ensemble_learning
>> >>
>> >> http://www.scholarpedia.org/article/Ensemble_learning#
>> >> Ensemble_combination_rules
>> >>
>> >>
>> >>>
>> >>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x:
>> >>> np.exp(x))
>> >>>
>> >>>  - [ ] What is this called / how does this work?
>> >>>    - https://docs.scipy.org/doc/numpy/reference/generated/
>> >> numpy.exp.html
>> >>>
>> >>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also
>> works
>> >>>  - http://pandas.pydata.org/pandas-docs/stable/generated/
>> >>> pandas.DataFrame.to_csv.html
>> >>>
>> >>>
>> >>>
>> >>>> My notebook is on GitHub for those interested:
>> >>>>
>> >>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/maste
>> r/attempt_4
>> >>>
>> >>>
>> >>> Thanks!
>> >>>
>> >>
>> >> (Trimmed for 40K limit)
>> >> _______________________________________________
>> >> Omaha Python Users Group mailing list
>> >> Omaha at python.org
>> >> https://mail.python.org/mailman/listinfo/omaha
>> >> http://www.OmahaPython.org
>> >>
>> > _______________________________________________
>> > Omaha Python Users Group mailing list
>> > Omaha at python.org
>> > https://mail.python.org/mailman/listinfo/omaha
>> > http://www.OmahaPython.org
>>
>> _______________________________________________
>> Omaha Python Users Group mailing list
>> Omaha at python.org
>> https://mail.python.org/mailman/listinfo/omaha
>> http://www.OmahaPython.org
>>
>
>