[omaha] Group Data Science Competition

Bob Haffner bob.haffner at gmail.com
Sun Feb 12 00:16:38 EST 2017


Wes, I didn't check the MSE.  I need to though as my submission didn't
score well at all  :-)

I used TensorFlow as the backend.   Also, I used the KerasRegressor model
so that made things pretty simple

On Fri, Feb 10, 2017 at 10:13 AM, Wes Turner <wes.turner at gmail.com> wrote:

> Anyone have a good way to re.find all the links in this thread?
>
> - [ ] linkgrep(thread) >> wiki
> - [ ] http://www.datatau.com is like news.ycombinator.com for Data Science
>
> @bob
> Cool.
> What is the (MSE, $ deviance) with Keras?
>
> Keras with Theano or TensorFlow?
>
> On Thursday, February 9, 2017, Bob Haffner via Omaha <omaha at python.org>
> wrote:
>
>> Added a Deep Learning section to my notebook
>> https://github.com/bobhaffner/kaggle-houseprices/blob/master
>> /kaggle_house_prices.ipynb
>>
>> Using Keras for the modeling with TensorFlow as the backend.
>>
>> I've generated a submission, but I don't know how it performed as Kaggle
>> seems to be on the fritz tonight.
>>
>> On Sat, Jan 14, 2017 at 12:52 AM, Wes Turner <wes.turner at gmail.com>
>> wrote:
>>
>> >
>> >
>> > On Friday, January 13, 2017, Bob Haffner via Omaha <omaha at python.org>
>> > wrote:
>> >
>> >> Look at that.  Two teams have submitted perfect scores :-)
>> >>
>> >> https://www.kaggle.com/c/house-prices-advanced-regression-
>> >> techniques/leaderboard
>> >
>> >
>> > https://www.kaggle.com/c/house-prices-advanced-regression-
>> techniques/rules
>> >
>> >    - Due to the public nature of the data, this competition does not
>> >    count towards Kaggle ranking points.
>> >    - We ask that you respect the spirit of the competition and do not
>> >    cheat. Hand-labeling is forbidden.
>> >
>> >
>> > https://www.kaggle.com/wiki/ModelSubmissionBestPractices
>> >
>> > https://www.kaggle.com/wiki/WinningModelDocumentationTemplate (CNN,
>> > XGBoost)
>> >
>> > Hopefully I can find some time to fix the data loading function in my
>> > data.py and test w/ TPOT (manual sparse arrays), auto_ml,
>> >
>> > - https://www.coursera.org/learn/ml-foundations/lecture/
>> > 2HrHv/learning-a-simple-regression-model-to-predict-
>> > house-prices-from-house-size (UW)
>> >
>> > - "Python Data Science Handbook" "This repository contains entire Python
>> > Data Science Handbook <http://shop.oreilly.com/product/0636920034919.do
>> >,
>>
>> > in the form of (free!) Jupyter notebooks."
>> > https://github.com/jakevdp/PythonDataScienceHandbook/
>> > blob/master/README.md#5-machine-learning (~UW)
>> >
>> > I'd also like to learn how to NN w/ tensors and  Keras (Theano,
>> TensorFlow)
>> > https://github.com/fchollet/keras
>> >
>> > - https://keras.io/getting-started/faq/#how-can-i-record-
>> > the-training-validation-loss-accuracy-at-each-epoch
>> >
>> > - http://machinelearningmastery.com/regression-tutorial-keras-
>> > deep-learning-library-python/
>> >
>> >
>> >> On Thu, Jan 5, 2017 at 11:20 AM, Bob Haffner <bob.haffner at gmail.com>
>> >> wrote:
>> >>
>> >> > Hi Travis,
>> >> >
>> >> >
>> >> >
>> >> > A few of us are doing the House Prices: Advanced Regression
>> Techniques
>> >> > competition
>> >> >
>> >> > https://www.kaggle.com/c/house-prices-advanced-regression-techniques
>> >> >
>> >> >
>> >> >
>> >> > Our team is called Omaha Pythonistas.  You are more than welcome to
>> join
>> >> > us!  Just let me know which email you use to sign up with on Kaggle
>> and
>> >> > I’ll send out an invite.
>> >> >
>> >> >
>> >> >
>> >> > We met in December and we hope to meet again soon.  Most likely
>> >> following
>> >> > our monthly meeting on 1/18
>> >> >
>> >> >
>> >> >
>> >> > Some our materials
>> >> >
>> >> > https://github.com/omahapython/kaggle-houseprices
>> >> >
>> >> >
>> >> >
>> >> > https://github.com/jeremy-doyle/home_price_kaggle
>> >> >
>> >> >
>> >> >
>> >> > https://github.com/bobhaffner/kaggle-houseprices
>> >> >
>> >> > On Wed, Jan 4, 2017 at 8:50 AM, Travis Smith via Omaha <
>> >> omaha at python.org>
>> >> > wrote:
>> >> >
>> >> >> Hey, new guy here. What's the challenge, exactly?  I'm not a Kaggler
>> >> yet,
>> >> >> but I have taken some data science courses.
>> >> >>
>> >> >> -Travis
>> >> >>
>> >> >> > On Jan 4, 2017, at 7:57, Luke Schollmeyer via Omaha <
>> >> omaha at python.org>
>> >> >> wrote:
>> >> >> >
>> >> >> > I think there's two probable things:
>> >> >> > 1. We're likely using some under-powered ML methods. Most of the
>> >> Kaggle
>> >> >> > interviews of the top guys/teams I read are using some much more
>> >> >> advanced
>> >> >> > methods to get their solutions into the top spots. I think what
>> we're
>> >> >> doing
>> >> >> > is fine for what we want to accomplish.
>> >> >> > 2. Feature engineering. Again, many of the interviews show that a
>> >> ton of
>> >> >> > work goes in to cleaning and conforming the data.
>> >> >> >
>> >> >> > I haven't back tracked any of the interviews to their submissions,
>> >> so I
>> >> >> > don't know how often they tend to submit, like tweak a small
>> aspect
>> >> and
>> >> >> > keep honing that until it pays off.
>> >> >> >
>> >> >> > On Wed, Jan 4, 2017 at 7:43 AM, Bob Haffner via Omaha <
>> >> omaha at python.org
>> >> >> >
>> >> >> > wrote:
>> >> >> >
>> >> >> >> Yeah, no kidding.  That pdf wasn't hard to find and that #1
>> score is
>> >> >> pretty
>> >> >> >> damn good
>> >> >> >>
>> >> >> >> On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha <
>> >> >> omaha at python.org>
>> >> >> >> wrote:
>> >> >> >>
>> >> >> >>> Looks like we have our key to a score of 0.0. Lol
>> >> >> >>>
>> >> >> >>> Seriously though, does anyone wonder if the person sitting at #1
>> >> had
>> >> >> this
>> >> >> >>> full data set as well and trained a model using the entire set?
>> I
>> >> mean
>> >> >> >> that
>> >> >> >>> 0.038 score is so much better than anyone else it seems a little
>> >> >> >>> unrealistic...or maybe it's just seems that way because I
>> haven't
>> >> been
>> >> >> >> able
>> >> >> >>> to break through 0.12   : )
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> Sent from my iPhone
>> >> >> >>>>> On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha <
>> >> omaha at python.org
>> >> >> >
>> >> >> >>>> wrote:
>> >> >> >>>>
>> >> >> >>>> Pretty interesting notebook I put together regarding the kaggle
>> >> comp
>> >> >> >>>> https://github.com/bobhaffner/kaggle-houseprices/blob/
>> >> >> >>> master/additional_training_data.ipynb
>> >> >> >>>>
>> >> >> >>>> On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha <
>> >> >> >> omaha at python.org>
>> >> >> >>>> wrote:
>> >> >> >>>>
>> >> >> >>>>>> On Wednesday, December 28, 2016, Wes Turner <
>> >> wes.turner at gmail.com>
>> >> >> >>> wrote:
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <
>> >> >> >>>>> omaha at python.org
>> >> >> >>>>>> <javascript:_e(%7B%7D,'cvml','omaha at python.org');>> wrote:
>> >> >> >>>>>>
>> >> >> >>>>>>> Woohoo! We jumped 286 positions with a meager 0.00448
>> >> improvement
>> >> >> in
>> >> >> >>> our
>> >> >> >>>>>>> score! Currently sitting at 798th place.
>> >> >> >>>>>>
>> >> >> >>>>>> Nice work! Features of your feature engineering I admire:
>> >> >> >>>>>>
>> >> >> >>>>>> - nominal, ordinal, continuous, discrete
>> >> >> >>>>>> categorical = nominal + discrete
>> >> >> >>>>>> numeric = continuous + discrete
>> >> >> >>>>>>
>> >> >> >>>>>> - outlier removal
>> >> >> >>>>>> - [ ] w/ constant thresholding? (is there a distribution
>> >> parameter)
>> >> >> >>>>>>
>> >> >> >>>>>> - building datestrings from SaleMonth and YrSold
>> >> >> >>>>>> - SaleMonth / "1" / YrSold
>> >> >> >>>>>>  - df..drop(['MoSold','YrSold','SaleMonth'])
>> >> >> >>>>>>    - [ ] why drop SaleMonth?
>> >> >> >>>>>> - [ ] pandas.to_datetime[df['SaleMonth'])
>> >> >> >>>>>>
>> >> >> >>>>>> - merging with FHA Home Price Index for the month and region
>> >> ("West
>> >> >> >>> North
>> >> >> >>>>>> Central")
>> >> >> >>>>>> https://www.fhfa.gov/DataTools/Downloads/Documents/
>> >> >> >>>>>> HPI/HPI_PO_monthly_hist.xls
>> >> >> >>>>>> - [ ] pandas.to_datetime
>> >> >> >>>>>>   - this should have every month, but the new merge_asof
>> >> feature is
>> >> >> >>>>>> worth mentioning
>> >> >> >>>>>>
>> >> >> >>>>>> - manual binarization
>> >> >> >>>>>> - [ ] how did you pick these? correlation after
>> pd.get_dummies?
>> >> >> >>>>>> - [ ] why floats? 1.0 / 1 (does it make a difference?)
>> >> >> >>>>>>
>> >> >> >>>>>> - Ames, IA nbrhood_multiplier
>> >> >> >>>>>> - http://www.cityofames.org/home/showdocument?id=1024
>> >> >> >>>>>>
>> >> >> >>>>>> - feature merging
>> >> >> >>>>>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2
>> >> >> >>>>>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) +
>> FullBath +
>> >> >> >>>>>> (HalfBath / 2.0)
>> >> >> >>>>>> - ( ) IDK how a feature-selection pipeline could do this
>> >> >> >> automatically
>> >> >> >>>>>>
>> >> >> >>>>>> - null value imputation
>> >> >> >>>>>> - .isnull() = 0
>> >> >> >>>>>> - ( ) datacleaner incorrectly sets these to median or mode
>> >> >> >>>>>>
>> >> >> >>>>>> - log for skewed continuous and SalePrice
>> >> >> >>>>>> - ( ) auto_ml: take_log_of_y does this for SalePrice
>> >> >> >>>>>>
>> >> >> >>>>>> - "Keeping only the columns we want"
>> >> >> >>>>>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename,
>> >> >> >>> index_col='Id')
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> - Binarization
>> >> >> >>>>>> - pd.get_dummies(dummy_na=False)
>> >> >> >>>>>> - [ ] (a Luke pointed out, concatenation keeps the same
>> columns)
>> >> >> >>>>>>       rows = eng_train.shape[0]
>> >> >> >>>>>>       eng_merged = pd.concat(eng_train, eng_test)
>> >> >> >>>>>>       onehot_merged = pd.get_dummies(eng_merged,
>> >> columns=nominal,
>> >> >> >>>>>> dummy_na=False)
>> >> >> >>>>>>       onehot_train = eng_merged[:rows]
>> >> >> >>>>>>       onehot_test = eng_merged[rows:]
>> >> >> >>>>>>
>> >> >> >>>>>> - class RandomSelectionHelper
>> >> >> >>>>>> - [ ] this could be generally helpful in sklean[-pandas]
>> >> >> >>>>>>   - https://github.com/paulgb/skle
>> arn-pandas#cross-validation
>> >> >> >>>>>>
>> >> >> >>>>>> - Models to Search
>> >> >> >>>>>> - {Ridge, Lasso, ElasticNet}
>> >> >> >>>>>>
>> >> >> >>>>>>    - https://github.com/ClimbsRocks/auto_ml/blob/
>> >> >> >>>>>> master/auto_ml/predictor.py#L222
>> >> >> >>>>>>      _get_estimator_names ( "regressor" )
>> >> >> >>>>>>      - {XGBRegessor, GradientBoostingRegressor,
>> RANSACRegressor,
>> >> >> >>>>>> RandomForestRegressor, LinearRegression, AdaBoostRegressor,
>> >> >> >>>>>> ExtraTreesRegressor}
>> >> >> >>>>>>
>> >> >> >>>>>>    - https://github.com/ClimbsRocks/auto_ml/blob/
>> >> >> >>>>>> master/auto_ml/predictor.py#L491
>> >> >> >>>>>>      - (w/ ensembling)
>> >> >> >>>>>>      -  ['RandomForestRegressor', 'LinearRegression',
>> >> >> >>>>>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor',
>> >> >> >>>>>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars',
>> >> >> >>>>>> 'OrthogonalMatchingPursuit', 'BayesianRidge',
>> 'SGDRegressor'] +
>> >> ['
>> >> >> >>>>>> XGBRegressor']
>> >> >> >>>>>>
>> >> >> >>>>>> - model stacking / ensembling
>> >> >> >>>>>>
>> >> >> >>>>>> - ( ) auto_ml: https://auto-ml.readthedocs.
>> >> >> >>>>> io/en/latest/ensembling.html
>> >> >> >>>>>> - ( ) auto-sklearn:
>> >> >> >>>>>>       https://automl.github.io/auto-sklearn/stable/api.html#
>> >> >> >>>>>> autosklearn.regression.AutoSklearnRegressor
>> >> >> >>>>>>       ensemble_size=50, ensemble_nbest=50
>> >> >> >>>>>
>> >> >> >>>>> https://en.wikipedia.org/wiki/Ensemble_learning
>> >> >> >>>>>
>> >> >> >>>>> http://www.scholarpedia.org/article/Ensemble_learning#
>> >> >> >>>>> Ensemble_combination_rules
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> - submission['SalePrice'] = submission.SalePrice.apply(lam
>> bda
>> >> x:
>> >> >> >>>>>> np.exp(x))
>> >> >> >>>>>>
>> >> >> >>>>>> - [ ] What is this called / how does this work?
>> >> >> >>>>>>   - https://docs.scipy.org/doc/numpy/reference/generated/
>> >> >> >>>>> numpy.exp.html
>> >> >> >>>>>>
>> >> >> >>>>>> - df.to_csv(filename, columns=['SalePrice'],
>> index_label='Id')
>> >> also
>> >> >> >>> works
>> >> >> >>>>>> - http://pandas.pydata.org/pandas-docs/stable/generated/
>> >> >> >>>>>> pandas.DataFrame.to_csv.html
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>> My notebook is on GitHub for those interested:
>> >> >> >>>>>>>
>> >> >> >>>>>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/
>> >> >> >>> master/attempt_4
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> Thanks!
>> >> >> >>>>>
>> >> >> >>>>> (Trimmed for 40K limit)
>> >> >> >>>>> _______________________________________________
>> >> >> >>>>> Omaha Python Users Group mailing list
>> >> >> >>>>> Omaha at python.org
>> >> >> >>>>> https://mail.python.org/mailman/listinfo/omaha
>> >> >> >>>>> http://www.OmahaPython.org
>> >> >> >>>> _______________________________________________
>> >> >> >>>> Omaha Python Users Group mailing list
>> >> >> >>>> Omaha at python.org
>> >> >> >>>> https://mail.python.org/mailman/listinfo/omaha
>> >> >> >>>> http://www.OmahaPython.org
>> >> >> >>>
>> >> >> >>> _______________________________________________
>> >> >> >>> Omaha Python Users Group mailing list
>> >> >> >>> Omaha at python.org
>> >> >> >>> https://mail.python.org/mailman/listinfo/omaha
>> >> >> >>> http://www.OmahaPython.org
>> >> >> >> _______________________________________________
>> >> >> >> Omaha Python Users Group mailing list
>> >> >> >> Omaha at python.org
>> >> >> >> https://mail.python.org/mailman/listinfo/omaha
>> >> >> >> http://www.OmahaPython.org
>> >> >> > _______________________________________________
>> >> >> > Omaha Python Users Group mailing list
>> >> >> > Omaha at python.org
>> >> >> > https://mail.python.org/mailman/listinfo/omaha
>> >> >> > http://www.OmahaPython.org
>> >> >> _______________________________________________
>> >> >> Omaha Python Users Group mailing list
>> >> >> Omaha at python.org
>> >> >> https://mail.python.org/mailman/listinfo/omaha
>> >> >> http://www.OmahaPython.org
>> >> >>
>> >> >
>> >> >
>> >> _______________________________________________
>> >> Omaha Python Users Group mailing list
>> >> Omaha at python.org
>> >> https://mail.python.org/mailman/listinfo/omaha
>> >> http://www.OmahaPython.org
>> >
>> >
>> _______________________________________________
>> Omaha Python Users Group mailing list
>> Omaha at python.org
>> https://mail.python.org/mailman/listinfo/omaha
>> http://www.OmahaPython.org
>>
>


More information about the Omaha mailing list