[omaha] Group Data Science Competition

Sun Dec 18 12:55:07 EST 2016

Wes, I can try to run your process with do_get_dummies=True.  Anything else need to change?

Sent from my iPhone

> On Dec 18, 2016, at 10:59 AM, Wes Turner <wes.turner at gmail.com> wrote:
> 
> Thanks, Bob!
> 
>> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner <bob.haffner at gmail.com> wrote:
>> Nice job, Wes!!
>> 
>>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner <wes.turner at gmail.com> wrote:
>>> In addition to posting to the mailing list, I created a comment on the "Kaggle Submissions" issue [1]:
>>> 
>>>> - Score: 0.13667 (#1370)
>>>>   - https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3925119
>>>>   - https://mail.python.org/pipermail/omaha/2016-December/002206.html
>>>>   - https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/pipelines/tpot_house_prices__001__modified.py
>>> 
>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2
>>> 
>>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner <wes.turner at gmail.com> wrote:
>>>> Sounds great. 1/18.
>>>> 
>>>> I just submitted my first submission.csv to Kaggle! [1]
>>>> 
>>>> $ python ./tpot_house_prices__001__modified.py
>>>> class_sum: 264144946
>>>> abs error: 5582809.288
>>>> % error:   2.11354007432 %
>>>> error**2:  252508654837.0
>>>> #  python ./tpot_house_prices__001__modified.py
>>>> 
>>>> ... Which moves us up to #1370!
>>>> 
>>>> Your Best Entry ↑
>>>> You improved on your best score by 0.02469.
>>>> You just moved up 608 positions on the leaderboard.
>>>> 
>>>> 
>>>> I have a few more things to try:
>>>> 
>>>> Manually drop the 'Id' column
>>>> do_get_dummies=True (data.py) + EC2 m4.4xlarge instance
>>>> I got an oom error w/ an 8GB notebook (at 25/120 w/ verbosity=2)
>>>> https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/data.py#L94
>>>> skleanGridSearch and/or sklearn-deap the TPOT hyperparameters
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
>>>> https://github.com/rsteca/sklearn-deap
>>>> REF,BLD,DOC,TST:
>>>> factor constants out in favor of settings.json and data.py
>>>> https://github.com/omahapython/kaggle-houseprices/blob/master/src/data.py
>>>> implement train.py and predict.py, too
>>>> create a Dockerfile FROM kaggle/docker-python:latest
>>>> https://github.com/omahapython/datascience/issues/3 "Kaggle Best Practices"
>>>> docstrings, tests
>>>> https://github.com/omahapython/datascience/wiki/resources
>>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/pipelines/tpot_house_prices__001__modified.py
>>>> 
>>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha <omaha at python.org> wrote:
>>>>> Hey all, regarding our January kaggle meetup that we talked about.  Maybe
>>>>> we can meet following our regular monthly (1/18).
>>>>> 
>>>>> Would that be easier/better for everyone?
>>>>> 
>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner <bob.haffner at gmail.com> wrote:
>>>>> 
>>>>> > Just submitted another Linear Regression attempt (0.16136).  Added some
>>>>> > features, both numeric and categorical, and created 3 numerics
>>>>> >
>>>>> > -TotalFullBaths
>>>>> > -TotalHalfBaths
>>>>> > -Pool
>>>>> >
>>>>> > Notebook attached
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner <bob.haffner at gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> >> Just submitted another Linear Regression attempt (0.16136).  Added some
>>>>> >> features, both numeric and categorical, and created 3 numerics
>>>>> >>
>>>>> >> -TotalFullBaths
>>>>> >> -TotalHalfBaths
>>>>> >> -Pool
>>>>> >>
>>>>> >> Notebook attached
>>>>> >>
>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner <wes.turner at gmail.com> wrote:
>>>>> >>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner <wes.turner at gmail.com>
>>>>> >>> wrote:
>>>>> >>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha <
>>>>> >>>> omaha at python.org> wrote:
>>>>> >>>>
>>>>> >>>>> >Does Kaggle take the high mark but still give a score for each
>>>>> >>>>> submission?
>>>>> >>>>> Yes.
>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te
>>>>> >>>>> chniques/submissions
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> >Thinking of ways to keep track of which code produced which score;
>>>>> >>>>> I'll
>>>>> >>>>> >post about the GitHub setup in a bit.
>>>>> >>>>> We could push our notebooks to the github repo?  Maybe include a brief
>>>>> >>>>> description at the top in a markdown cell
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>> In my research [1], I found that the preferred folder structure for
>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/
>>>>> >>>> (outputs);
>>>>> >>>> and that they recommend creating a settings.json with path
>>>>> >>>> configuration (e.g. pointing to input/, src/ data/)
>>>>> >>>>
>>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2].
>>>>> >>>>
>>>>> >>>> runipy is a bit more scriptable than requiring notebook gui
>>>>> >>>> interactions [3].
>>>>> >>>>
>>>>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb
>>>>> >>>> sources, or we could write a function in src/data.py to read
>>>>> >>>> '../settings.json' into a dict with the recommended variable names [1]:
>>>>> >>>>
>>>>> >>>>     from data import read_settings_json
>>>>> >>>>     settings = read_settings_json()
>>>>> >>>>     train = pd.read_csv(settings['TRAIN_DATA_PATH'])
>>>>> >>>>     # ....
>>>>> >>>>     pd.write_csv(settings['SUBMISSION_PATH'])
>>>>> >>>>
>>>>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom
>>>>> >>>> ment-267236556
>>>>> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/master/src
>>>>> >>>> [3] https://pypi.python.org/pypi/runipy
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>>
>>>>> >>>>> I initially thought github was a good way to go, but I don't know if
>>>>> >>>>> everyone has a github acct or is interested in starting one.   Maybe
>>>>> >>>>> email
>>>>> >>>>> is the way to go?
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>> I'm all for GitHub:
>>>>> >>>>
>>>>> >>>> - git source control and revision numbers
>>>>> >>>> - we're not able to easily share code in the mailing list
>>>>> >>>> - we can learn from each others' solutions
>>>>> >>>>
>>>>> >>>
>>>>> >>> An example of mailing list limitations:
>>>>> >>>
>>>>> >>>
>>>>> >>> Your mail to 'Omaha' with the subject
>>>>> >>>
>>>>> >>>     Re: [omaha] Group Data Science Competition
>>>>> >>>
>>>>> >>> Is being held until the list moderator can review it for approval.
>>>>> >>>
>>>>> >>> The reason it is being held:
>>>>> >>>
>>>>> >>>     Message body is too big: 47004 bytes with a limit of 40 KB
>>>>> >>>
>>>>> >>>  (I trimmed out the reply chain; so this may make it through first)
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >
>>>>> _______________________________________________
>>>>> Omaha Python Users Group mailing list
>>>>> Omaha at python.org
>>>>> https://mail.python.org/mailman/listinfo/omaha
>>>>> http://www.OmahaPython.org
>>>> 
>>> 
>> 
>