[omaha] Group Data Science Competition

Fri Feb 10 11:13:10 EST 2017

Anyone have a good way to re.find all the links in this thread?

- [ ] linkgrep(thread) >> wiki
- [ ] http://www.datatau.com is like news.ycombinator.com for Data Science

@bob
Cool.
What is the (MSE, $ deviance) with Keras?

Keras with Theano or TensorFlow?

On Thursday, February 9, 2017, Bob Haffner via Omaha <omaha at python.org>
wrote:

> Added a Deep Learning section to my notebook
> https://github.com/bobhaffner/kaggle-houseprices/blob/
> master/kaggle_house_prices.ipynb
>
> Using Keras for the modeling with TensorFlow as the backend.
>
> I've generated a submission, but I don't know how it performed as Kaggle
> seems to be on the fritz tonight.
>
> On Sat, Jan 14, 2017 at 12:52 AM, Wes Turner <wes.turner at gmail.com
> <javascript:;>> wrote:
>
> >
> >
> > On Friday, January 13, 2017, Bob Haffner via Omaha <omaha at python.org
> <javascript:;>>
> > wrote:
> >
> >> Look at that.  Two teams have submitted perfect scores :-)
> >>
> >> https://www.kaggle.com/c/house-prices-advanced-regression-
> >> techniques/leaderboard
> >
> >
> > https://www.kaggle.com/c/house-prices-advanced-
> regression-techniques/rules
> >
> >    - Due to the public nature of the data, this competition does not
> >    count towards Kaggle ranking points.
> >    - We ask that you respect the spirit of the competition and do not
> >    cheat. Hand-labeling is forbidden.
> >
> >
> > https://www.kaggle.com/wiki/ModelSubmissionBestPractices
> >
> > https://www.kaggle.com/wiki/WinningModelDocumentationTemplate (CNN,
> > XGBoost)
> >
> > Hopefully I can find some time to fix the data loading function in my
> > data.py and test w/ TPOT (manual sparse arrays), auto_ml,
> >
> > - https://www.coursera.org/learn/ml-foundations/lecture/
> > 2HrHv/learning-a-simple-regression-model-to-predict-
> > house-prices-from-house-size (UW)
> >
> > - "Python Data Science Handbook" "This repository contains entire Python
> > Data Science Handbook <http://shop.oreilly.com/product/0636920034919.do
> >,
> > in the form of (free!) Jupyter notebooks."
> > https://github.com/jakevdp/PythonDataScienceHandbook/
> > blob/master/README.md#5-machine-learning (~UW)
> >
> > I'd also like to learn how to NN w/ tensors and  Keras (Theano,
> TensorFlow)
> > https://github.com/fchollet/keras
> >
> > - https://keras.io/getting-started/faq/#how-can-i-record-
> > the-training-validation-loss-accuracy-at-each-epoch
> >
> > - http://machinelearningmastery.com/regression-tutorial-keras-
> > deep-learning-library-python/
> >
> >
> >> On Thu, Jan 5, 2017 at 11:20 AM, Bob Haffner <bob.haffner at gmail.com
> <javascript:;>>
> >> wrote:
> >>
> >> > Hi Travis,
> >> >
> >> >
> >> >
> >> > A few of us are doing the House Prices: Advanced Regression Techniques
> >> > competition
> >> >
> >> > https://www.kaggle.com/c/house-prices-advanced-regression-techniques
> >> >
> >> >
> >> >
> >> > Our team is called Omaha Pythonistas.  You are more than welcome to
> join
> >> > us!  Just let me know which email you use to sign up with on Kaggle
> and
> >> > I’ll send out an invite.
> >> >
> >> >
> >> >
> >> > We met in December and we hope to meet again soon.  Most likely
> >> following
> >> > our monthly meeting on 1/18
> >> >
> >> >
> >> >
> >> > Some our materials
> >> >
> >> > https://github.com/omahapython/kaggle-houseprices
> >> >
> >> >
> >> >
> >> > https://github.com/jeremy-doyle/home_price_kaggle
> >> >
> >> >
> >> >
> >> > https://github.com/bobhaffner/kaggle-houseprices
> >> >
> >> > On Wed, Jan 4, 2017 at 8:50 AM, Travis Smith via Omaha <
> >> omaha at python.org <javascript:;>>
> >> > wrote:
> >> >
> >> >> Hey, new guy here. What's the challenge, exactly?  I'm not a Kaggler
> >> yet,
> >> >> but I have taken some data science courses.
> >> >>
> >> >> -Travis
> >> >>
> >> >> > On Jan 4, 2017, at 7:57, Luke Schollmeyer via Omaha <
> >> omaha at python.org <javascript:;>>
> >> >> wrote:
> >> >> >
> >> >> > I think there's two probable things:
> >> >> > 1. We're likely using some under-powered ML methods. Most of the
> >> Kaggle
> >> >> > interviews of the top guys/teams I read are using some much more
> >> >> advanced
> >> >> > methods to get their solutions into the top spots. I think what
> we're
> >> >> doing
> >> >> > is fine for what we want to accomplish.
> >> >> > 2. Feature engineering. Again, many of the interviews show that a
> >> ton of
> >> >> > work goes in to cleaning and conforming the data.
> >> >> >
> >> >> > I haven't back tracked any of the interviews to their submissions,
> >> so I
> >> >> > don't know how often they tend to submit, like tweak a small aspect
> >> and
> >> >> > keep honing that until it pays off.
> >> >> >
> >> >> > On Wed, Jan 4, 2017 at 7:43 AM, Bob Haffner via Omaha <
> >> omaha at python.org <javascript:;>
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> >> Yeah, no kidding.  That pdf wasn't hard to find and that #1 score
> is
> >> >> pretty
> >> >> >> damn good
> >> >> >>
> >> >> >> On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha <
> >> >> omaha at python.org <javascript:;>>
> >> >> >> wrote:
> >> >> >>
> >> >> >>> Looks like we have our key to a score of 0.0. Lol
> >> >> >>>
> >> >> >>> Seriously though, does anyone wonder if the person sitting at #1
> >> had
> >> >> this
> >> >> >>> full data set as well and trained a model using the entire set? I
> >> mean
> >> >> >> that
> >> >> >>> 0.038 score is so much better than anyone else it seems a little
> >> >> >>> unrealistic...or maybe it's just seems that way because I haven't
> >> been
> >> >> >> able
> >> >> >>> to break through 0.12   : )
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> Sent from my iPhone
> >> >> >>>>> On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha <
> >> omaha at python.org <javascript:;>
> >> >> >
> >> >> >>>> wrote:
> >> >> >>>>
> >> >> >>>> Pretty interesting notebook I put together regarding the kaggle
> >> comp
> >> >> >>>> https://github.com/bobhaffner/kaggle-houseprices/blob/
> >> >> >>> master/additional_training_data.ipynb
> >> >> >>>>
> >> >> >>>> On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha <
> >> >> >> omaha at python.org <javascript:;>>
> >> >> >>>> wrote:
> >> >> >>>>
> >> >> >>>>>> On Wednesday, December 28, 2016, Wes Turner <
> >> wes.turner at gmail.com <javascript:;>>
> >> >> >>> wrote:
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha <
> >> >> >>>>> omaha at python.org <javascript:;>
> >> >> >>>>>> <javascript:_e(%7B%7D,'cvml','omaha at python.org <javascript:;>');>>
> wrote:
> >> >> >>>>>>
> >> >> >>>>>>> Woohoo! We jumped 286 positions with a meager 0.00448
> >> improvement
> >> >> in
> >> >> >>> our
> >> >> >>>>>>> score! Currently sitting at 798th place.
> >> >> >>>>>>
> >> >> >>>>>> Nice work! Features of your feature engineering I admire:
> >> >> >>>>>>
> >> >> >>>>>> - nominal, ordinal, continuous, discrete
> >> >> >>>>>> categorical = nominal + discrete
> >> >> >>>>>> numeric = continuous + discrete
> >> >> >>>>>>
> >> >> >>>>>> - outlier removal
> >> >> >>>>>> - [ ] w/ constant thresholding? (is there a distribution
> >> parameter)
> >> >> >>>>>>
> >> >> >>>>>> - building datestrings from SaleMonth and YrSold
> >> >> >>>>>> - SaleMonth / "1" / YrSold
> >> >> >>>>>>  - df..drop(['MoSold','YrSold','SaleMonth'])
> >> >> >>>>>>    - [ ] why drop SaleMonth?
> >> >> >>>>>> - [ ] pandas.to_datetime[df['SaleMonth'])
> >> >> >>>>>>
> >> >> >>>>>> - merging with FHA Home Price Index for the month and region
> >> ("West
> >> >> >>> North
> >> >> >>>>>> Central")
> >> >> >>>>>> https://www.fhfa.gov/DataTools/Downloads/Documents/
> >> >> >>>>>> HPI/HPI_PO_monthly_hist.xls
> >> >> >>>>>> - [ ] pandas.to_datetime
> >> >> >>>>>>   - this should have every month, but the new merge_asof
> >> feature is
> >> >> >>>>>> worth mentioning
> >> >> >>>>>>
> >> >> >>>>>> - manual binarization
> >> >> >>>>>> - [ ] how did you pick these? correlation after
> pd.get_dummies?
> >> >> >>>>>> - [ ] why floats? 1.0 / 1 (does it make a difference?)
> >> >> >>>>>>
> >> >> >>>>>> - Ames, IA nbrhood_multiplier
> >> >> >>>>>> - http://www.cityofames.org/home/showdocument?id=1024
> >> >> >>>>>>
> >> >> >>>>>> - feature merging
> >> >> >>>>>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2
> >> >> >>>>>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath
> +
> >> >> >>>>>> (HalfBath / 2.0)
> >> >> >>>>>> - ( ) IDK how a feature-selection pipeline could do this
> >> >> >> automatically
> >> >> >>>>>>
> >> >> >>>>>> - null value imputation
> >> >> >>>>>> - .isnull() = 0
> >> >> >>>>>> - ( ) datacleaner incorrectly sets these to median or mode
> >> >> >>>>>>
> >> >> >>>>>> - log for skewed continuous and SalePrice
> >> >> >>>>>> - ( ) auto_ml: take_log_of_y does this for SalePrice
> >> >> >>>>>>
> >> >> >>>>>> - "Keeping only the columns we want"
> >> >> >>>>>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename,
> >> >> >>> index_col='Id')
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>> - Binarization
> >> >> >>>>>> - pd.get_dummies(dummy_na=False)
> >> >> >>>>>> - [ ] (a Luke pointed out, concatenation keeps the same
> columns)
> >> >> >>>>>>       rows = eng_train.shape[0]
> >> >> >>>>>>       eng_merged = pd.concat(eng_train, eng_test)
> >> >> >>>>>>       onehot_merged = pd.get_dummies(eng_merged,
> >> columns=nominal,
> >> >> >>>>>> dummy_na=False)
> >> >> >>>>>>       onehot_train = eng_merged[:rows]
> >> >> >>>>>>       onehot_test = eng_merged[rows:]
> >> >> >>>>>>
> >> >> >>>>>> - class RandomSelectionHelper
> >> >> >>>>>> - [ ] this could be generally helpful in sklean[-pandas]
> >> >> >>>>>>   - https://github.com/paulgb/sklearn-pandas#cross-validation
> >> >> >>>>>>
> >> >> >>>>>> - Models to Search
> >> >> >>>>>> - {Ridge, Lasso, ElasticNet}
> >> >> >>>>>>
> >> >> >>>>>>    - https://github.com/ClimbsRocks/auto_ml/blob/
> >> >> >>>>>> master/auto_ml/predictor.py#L222
> >> >> >>>>>>      _get_estimator_names ( "regressor" )
> >> >> >>>>>>      - {XGBRegessor, GradientBoostingRegressor,
> RANSACRegressor,
> >> >> >>>>>> RandomForestRegressor, LinearRegression, AdaBoostRegressor,
> >> >> >>>>>> ExtraTreesRegressor}
> >> >> >>>>>>
> >> >> >>>>>>    - https://github.com/ClimbsRocks/auto_ml/blob/
> >> >> >>>>>> master/auto_ml/predictor.py#L491
> >> >> >>>>>>      - (w/ ensembling)
> >> >> >>>>>>      -  ['RandomForestRegressor', 'LinearRegression',
> >> >> >>>>>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor',
> >> >> >>>>>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars',
> >> >> >>>>>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor']
> +
> >> ['
> >> >> >>>>>> XGBRegressor']
> >> >> >>>>>>
> >> >> >>>>>> - model stacking / ensembling
> >> >> >>>>>>
> >> >> >>>>>> - ( ) auto_ml: https://auto-ml.readthedocs.
> >> >> >>>>> io/en/latest/ensembling.html
> >> >> >>>>>> - ( ) auto-sklearn:
> >> >> >>>>>>       https://automl.github.io/auto-sklearn/stable/api.html#
> >> >> >>>>>> autosklearn.regression.AutoSklearnRegressor
> >> >> >>>>>>       ensemble_size=50, ensemble_nbest=50
> >> >> >>>>>
> >> >> >>>>> https://en.wikipedia.org/wiki/Ensemble_learning
> >> >> >>>>>
> >> >> >>>>> http://www.scholarpedia.org/article/Ensemble_learning#
> >> >> >>>>> Ensemble_combination_rules
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>>
> >> >> >>>>>> - submission['SalePrice'] = submission.SalePrice.apply(lambda
> >> x:
> >> >> >>>>>> np.exp(x))
> >> >> >>>>>>
> >> >> >>>>>> - [ ] What is this called / how does this work?
> >> >> >>>>>>   - https://docs.scipy.org/doc/numpy/reference/generated/
> >> >> >>>>> numpy.exp.html
> >> >> >>>>>>
> >> >> >>>>>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id')
> >> also
> >> >> >>> works
> >> >> >>>>>> - http://pandas.pydata.org/pandas-docs/stable/generated/
> >> >> >>>>>> pandas.DataFrame.to_csv.html
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>> My notebook is on GitHub for those interested:
> >> >> >>>>>>>
> >> >> >>>>>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/
> >> >> >>> master/attempt_4
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>> Thanks!
> >> >> >>>>>
> >> >> >>>>> (Trimmed for 40K limit)
> >> >> >>>>> _______________________________________________
> >> >> >>>>> Omaha Python Users Group mailing list
> >> >> >>>>> Omaha at python.org <javascript:;>
> >> >> >>>>> https://mail.python.org/mailman/listinfo/omaha
> >> >> >>>>> http://www.OmahaPython.org
> >> >> >>>> _______________________________________________
> >> >> >>>> Omaha Python Users Group mailing list
> >> >> >>>> Omaha at python.org <javascript:;>
> >> >> >>>> https://mail.python.org/mailman/listinfo/omaha
> >> >> >>>> http://www.OmahaPython.org
> >> >> >>>
> >> >> >>> _______________________________________________
> >> >> >>> Omaha Python Users Group mailing list
> >> >> >>> Omaha at python.org <javascript:;>
> >> >> >>> https://mail.python.org/mailman/listinfo/omaha
> >> >> >>> http://www.OmahaPython.org
> >> >> >> _______________________________________________
> >> >> >> Omaha Python Users Group mailing list
> >> >> >> Omaha at python.org <javascript:;>
> >> >> >> https://mail.python.org/mailman/listinfo/omaha
> >> >> >> http://www.OmahaPython.org
> >> >> > _______________________________________________
> >> >> > Omaha Python Users Group mailing list
> >> >> > Omaha at python.org <javascript:;>
> >> >> > https://mail.python.org/mailman/listinfo/omaha
> >> >> > http://www.OmahaPython.org
> >> >> _______________________________________________
> >> >> Omaha Python Users Group mailing list
> >> >> Omaha at python.org <javascript:;>
> >> >> https://mail.python.org/mailman/listinfo/omaha
> >> >> http://www.OmahaPython.org
> >> >>
> >> >
> >> >
> >> _______________________________________________
> >> Omaha Python Users Group mailing list
> >> Omaha at python.org <javascript:;>
> >> https://mail.python.org/mailman/listinfo/omaha
> >> http://www.OmahaPython.org
> >
> >
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org <javascript:;>
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org