[omaha] Group Data Science Competition

Wes Turner wes.turner at gmail.com
Sun Dec 18 05:11:08 EST 2016


In addition to posting to the mailing list, I created a comment on the
"Kaggle Submissions" issue [1]:

- Score: 0.13667 (#1370)
>   -
> https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3925119
>   - https://mail.python.org/pipermail/omaha/2016-December/002206.html
>   -
> https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/pipelines/tpot_house_prices__001__modified.py


[1] https://github.com/omahapython/kaggle-houseprices/issues/2

On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner <wes.turner at gmail.com> wrote:

> Sounds great. 1/18.
>
> I just submitted my first submission.csv to Kaggle! [1]
>
> $ python ./tpot_house_prices__001__modified.py
> class_sum: 264144946
> abs error: 5582809.288
> % error:   2.11354007432 %
> error**2:  252508654837.0
> #  python ./tpot_house_prices__001__modified.py
>
>
> ... Which moves us up to #1370!
>
> Your Best Entry ↑
> You improved on your best score by 0.02469.
> You just moved up 608 positions on the leaderboard.
>
>
> I have a few more things to try:
>
>
>    - Manually drop the 'Id' column
>    - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance
>       - I got an oom error w/ an 8GB notebook (at 25/120 w/ verbosity=2)
>       - https://github.com/westurner/house_prices/blob/2839ff8a/
>       house_prices/data.py#L94
>       - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters
>       - http://scikit-learn.org/stable/modules/generated/
>       sklearn.model_selection.GridSearchCV.html#sklearn.
>       model_selection.GridSearchCV
>       <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>
>       - https://github.com/rsteca/sklearn-deap
>    - REF,BLD,DOC,TST:
>       - factor constants out in favor of settings.json and data.py
>          - https://github.com/omahapython/kaggle-
>          houseprices/blob/master/src/data.py
>          <https://github.com/omahapython/kaggle-houseprices/blob/master/src/data.py>
>       - implement train.py and predict.py, too
>       - create a Dockerfile FROM kaggle/docker-python:latest
>          - https://github.com/omahapython/datascience/issues/3 "Kaggle
>          Best Practices"
>       - docstrings, tests
>    - https://github.com/omahapython/datascience/wiki/resources
>
> [1] https://github.com/westurner/house_prices/blob/2839ff8a/
> house_prices/pipelines/tpot_house_prices__001__modified.py
>
> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha <omaha at python.org>
> wrote:
>
>> Hey all, regarding our January kaggle meetup that we talked about.  Maybe
>> we can meet following our regular monthly (1/18).
>>
>> Would that be easier/better for everyone?
>>
>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner <bob.haffner at gmail.com>
>> wrote:
>>
>> > Just submitted another Linear Regression attempt (0.16136).  Added some
>> > features, both numeric and categorical, and created 3 numerics
>> >
>> > -TotalFullBaths
>> > -TotalHalfBaths
>> > -Pool
>> >
>> > Notebook attached
>> >
>> >
>> >
>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner <bob.haffner at gmail.com>
>> > wrote:
>> >
>> >> Just submitted another Linear Regression attempt (0.16136).  Added some
>> >> features, both numeric and categorical, and created 3 numerics
>> >>
>> >> -TotalFullBaths
>> >> -TotalHalfBaths
>> >> -Pool
>> >>
>> >> Notebook attached
>> >>
>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner <wes.turner at gmail.com>
>> wrote:
>> >>
>> >>>
>> >>>
>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner <wes.turner at gmail.com>
>> >>> wrote:
>> >>>
>> >>>>
>> >>>>
>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha <
>> >>>> omaha at python.org> wrote:
>> >>>>
>> >>>>> >Does Kaggle take the high mark but still give a score for each
>> >>>>> submission?
>> >>>>> Yes.
>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te
>> >>>>> chniques/submissions
>> >>>>>
>> >>>>>
>> >>>>> >Thinking of ways to keep track of which code produced which score;
>> >>>>> I'll
>> >>>>> >post about the GitHub setup in a bit.
>> >>>>> We could push our notebooks to the github repo?  Maybe include a
>> brief
>> >>>>> description at the top in a markdown cell
>> >>>>>
>> >>>>
>> >>>> In my research [1], I found that the preferred folder structure for
>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/
>> >>>> (outputs);
>> >>>> and that they recommend creating a settings.json with path
>> >>>> configuration (e.g. pointing to input/, src/ data/)
>> >>>>
>> >>>> So, we could put notebooks, folders, and repos in src/ [2].
>> >>>>
>> >>>> runipy is a bit more scriptable than requiring notebook gui
>> >>>> interactions [3].
>> >>>>
>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb
>> >>>> sources, or we could write a function in src/data.py to read
>> >>>> '../settings.json' into a dict with the recommended variable names
>> [1]:
>> >>>>
>> >>>>     from data import read_settings_json
>> >>>>     settings = read_settings_json()
>> >>>>     train = pd.read_csv(settings['TRAIN_DATA_PATH'])
>> >>>>     # ....
>> >>>>     pd.write_csv(settings['SUBMISSION_PATH'])
>> >>>>
>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom
>> >>>> ment-267236556
>> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/
>> master/src
>> >>>> [3] https://pypi.python.org/pypi/runipy
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> I initially thought github was a good way to go, but I don't know if
>> >>>>> everyone has a github acct or is interested in starting one.   Maybe
>> >>>>> email
>> >>>>> is the way to go?
>> >>>>>
>> >>>>
>> >>>> I'm all for GitHub:
>> >>>>
>> >>>> - git source control and revision numbers
>> >>>> - we're not able to easily share code in the mailing list
>> >>>> - we can learn from each others' solutions
>> >>>>
>> >>>
>> >>> An example of mailing list limitations:
>> >>>
>> >>>
>> >>> Your mail to 'Omaha' with the subject
>> >>>
>> >>>     Re: [omaha] Group Data Science Competition
>> >>>
>> >>> Is being held until the list moderator can review it for approval.
>> >>>
>> >>> The reason it is being held:
>> >>>
>> >>>     Message body is too big: 47004 bytes with a limit of 40 KB
>> >>>
>> >>>  (I trimmed out the reply chain; so this may make it through first)
>> >>>
>> >>
>> >>
>> >
>> _______________________________________________
>> Omaha Python Users Group mailing list
>> Omaha at python.org
>> https://mail.python.org/mailman/listinfo/omaha
>> http://www.OmahaPython.org
>>
>
>


More information about the Omaha mailing list