[omaha] Group Data Science Competition

Sun Dec 18 19:25:41 EST 2016

Looks like it's not as simple as setting a value to True so I'll let you
sort it out.

On Sun, Dec 18, 2016 at 1:00 PM, Wes Turner <wes.turner at gmail.com> wrote:

>
> On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner <bob.haffner at gmail.com>
> wrote:
>
>> Wes, I can try to run your process with do_get_dummies=True.  Anything
>> else need to change?
>>
>
> Yup,
>
> https://github.com/westurner/house_prices/blob/2839ff8a/
> house_prices/data.py#L94 :
>
>     if do_get_dummies:
>         def get_categorical_columns(column_categories):
>             for colkey in column_categories:
>                 values = column_categories[colkey]
>                 if len(values):
>                     yield colkey
>         categorical_columns = list(get_categorical_columns(column_categories))
>         get_dummies_dict = {key: key for key in categorical_columns}
>         df = pd.get_dummies(df, prefix=get_dummies_dict, columns=get_dummies_dict)
>
> Needs to also be applied to train_csv and test_csv in the generated and
> modified pipeline:
> https://github.com/westurner/house_prices/blob/2839ff8a/
> house_prices/pipelines/tpot_house_prices__001__modified.py#L40
>
> So, I can either copy/paste or factor or it out:
>
> - copy/paste: just wrong
> - factor it out:
>   - this creates a (new) dependency on house_prices from within the
> generated pipeline; which currently depends on [stable versions of]
> (datacleaner, pandas, and scikit-learn)
>
> ... TODO: today
>
> - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df)
> - [ ] Dockerfile
>   - probably the easiest way to reproduce the environment.yml
> - [ ] automate the __modified.py patching process :
>
>
>            # git clone ssh://git@github.com/westurner/house_prices  # -b
> develop
>      conda env update -f ./environment.yml
>      cd house_prices/
>
> python ./analysis.py
>
> # (wait)
>
> mv ./pipelines/tpot_house_prices_.py \
>    ./pipelines/tpot_house_prices__002.py
> mv ./pipelines/tpot_house_prices_.py.json \
>    ./pipelines/tpot_house_prices__002.py.json
> cp ./pipelines/tpot_house_prices__001__modified.py \
>    ./pipelines/tpot_house_prices__002__modified.py
> # copy/paste (TODO: patch/template):
> #   - exported_pipeline / self.exported_pipeline
> #   - the sklearn imports] to __002__modified.py
> cd pipelines/  # TODO: settings.json
> python ./tpot_house_prices__002__modified.py
>
>
> ... The modified pipeline generation is not quite reproducible yet, but
> the generated pipeline (tpot_house_prices__001[__modified].py) is. (With
> ~2% error ... only about ~$6mn dollars off :|)
>
>
>
>> Sent from my iPhone
>>
>> On Dec 18, 2016, at 10:59 AM, Wes Turner <wes.turner at gmail.com> wrote:
>>
>> Thanks, Bob!
>>
>> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner <bob.haffner at gmail.com>
>> wrote:
>>
>>> Nice job, Wes!!
>>>
>>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner <wes.turner at gmail.com>
>>> wrote:
>>>
>>>> In addition to posting to the mailing list, I created a comment on the
>>>> "Kaggle Submissions" issue [1]:
>>>>
>>>> - Score: 0.13667 (#1370)
>>>>>   - https://www.kaggle.com/c/house-prices-advanced-regression-te
>>>>> chniques/leaderboard?submissionId=3925119
>>>>>   - https://mail.python.org/pipermail/omaha/2016-December/002206.html
>>>>>   - https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py
>>>>
>>>>
>>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2
>>>>
>>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner <wes.turner at gmail.com>
>>>> wrote:
>>>>
>>>>> Sounds great. 1/18.
>>>>>
>>>>> I just submitted my first submission.csv to Kaggle! [1]
>>>>>
>>>>> $ python ./tpot_house_prices__001__modified.py
>>>>> class_sum: 264144946
>>>>> abs error: 5582809.288
>>>>> % error:   2.11354007432 %
>>>>> error**2:  252508654837.0
>>>>> #  python ./tpot_house_prices__001__modified.py
>>>>>
>>>>>
>>>>> ... Which moves us up to #1370!
>>>>>
>>>>> Your Best Entry ↑
>>>>> You improved on your best score by 0.02469.
>>>>> You just moved up 608 positions on the leaderboard.
>>>>>
>>>>>
>>>>> I have a few more things to try:
>>>>>
>>>>>
>>>>>    - Manually drop the 'Id' column
>>>>>    - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance
>>>>>       - I got an oom error w/ an 8GB notebook (at 25/120 w/
>>>>>       verbosity=2)
>>>>>       - https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>>>>       e_prices/data.py#L94
>>>>>       - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters
>>>>>       - http://scikit-learn.org/stable/modules/generated/sklearn.mod
>>>>>       el_selection.GridSearchCV.html#sklearn.model_selection.GridS
>>>>>       earchCV
>>>>>       <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>
>>>>>       - https://github.com/rsteca/sklearn-deap
>>>>>    - REF,BLD,DOC,TST:
>>>>>       - factor constants out in favor of settings.json and data.py
>>>>>          - https://github.com/omahapython
>>>>>          /kaggle-houseprices/blob/master/src/data.py
>>>>>          <https://github.com/omahapython/kaggle-houseprices/blob/master/src/data.py>
>>>>>       - implement train.py and predict.py, too
>>>>>       - create a Dockerfile FROM kaggle/docker-python:latest
>>>>>          - https://github.com/omahapython/datascience/issues/3
>>>>>          "Kaggle Best Practices"
>>>>>       - docstrings, tests
>>>>>    - https://github.com/omahapython/datascience/wiki/resources
>>>>>
>>>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py
>>>>>
>>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha <
>>>>> omaha at python.org> wrote:
>>>>>
>>>>>> Hey all, regarding our January kaggle meetup that we talked about.
>>>>>> Maybe
>>>>>> we can meet following our regular monthly (1/18).
>>>>>>
>>>>>> Would that be easier/better for everyone?
>>>>>>
>>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner <bob.haffner at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> > Just submitted another Linear Regression attempt (0.16136).  Added
>>>>>> some
>>>>>> > features, both numeric and categorical, and created 3 numerics
>>>>>> >
>>>>>> > -TotalFullBaths
>>>>>> > -TotalHalfBaths
>>>>>> > -Pool
>>>>>> >
>>>>>> > Notebook attached
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner <bob.haffner at gmail.com
>>>>>> >
>>>>>> > wrote:
>>>>>> >
>>>>>> >> Just submitted another Linear Regression attempt (0.16136).  Added
>>>>>> some
>>>>>> >> features, both numeric and categorical, and created 3 numerics
>>>>>> >>
>>>>>> >> -TotalFullBaths
>>>>>> >> -TotalHalfBaths
>>>>>> >> -Pool
>>>>>> >>
>>>>>> >> Notebook attached
>>>>>> >>
>>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner <wes.turner at gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner <wes.turner at gmail.com
>>>>>> >
>>>>>> >>> wrote:
>>>>>> >>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha <
>>>>>> >>>> omaha at python.org> wrote:
>>>>>> >>>>
>>>>>> >>>>> >Does Kaggle take the high mark but still give a score for each
>>>>>> >>>>> submission?
>>>>>> >>>>> Yes.
>>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te
>>>>>> >>>>> chniques/submissions
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> >Thinking of ways to keep track of which code produced which
>>>>>> score;
>>>>>> >>>>> I'll
>>>>>> >>>>> >post about the GitHub setup in a bit.
>>>>>> >>>>> We could push our notebooks to the github repo?  Maybe include
>>>>>> a brief
>>>>>> >>>>> description at the top in a markdown cell
>>>>>> >>>>>
>>>>>> >>>>
>>>>>> >>>> In my research [1], I found that the preferred folder structure
>>>>>> for
>>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and
>>>>>> working/
>>>>>> >>>> (outputs);
>>>>>> >>>> and that they recommend creating a settings.json with path
>>>>>> >>>> configuration (e.g. pointing to input/, src/ data/)
>>>>>> >>>>
>>>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2].
>>>>>> >>>>
>>>>>> >>>> runipy is a bit more scriptable than requiring notebook gui
>>>>>> >>>> interactions [3].
>>>>>> >>>>
>>>>>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb
>>>>>> >>>> sources, or we could write a function in src/data.py to read
>>>>>> >>>> '../settings.json' into a dict with the recommended variable
>>>>>> names [1]:
>>>>>> >>>>
>>>>>> >>>>     from data import read_settings_json
>>>>>> >>>>     settings = read_settings_json()
>>>>>> >>>>     train = pd.read_csv(settings['TRAIN_DATA_PATH'])
>>>>>> >>>>     # ....
>>>>>> >>>>     pd.write_csv(settings['SUBMISSION_PATH'])
>>>>>> >>>>
>>>>>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom
>>>>>> >>>> ment-267236556
>>>>>> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/maste
>>>>>> r/src
>>>>>> >>>> [3] https://pypi.python.org/pypi/runipy
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>>
>>>>>> >>>>> I initially thought github was a good way to go, but I don't
>>>>>> know if
>>>>>> >>>>> everyone has a github acct or is interested in starting one.
>>>>>>  Maybe
>>>>>> >>>>> email
>>>>>> >>>>> is the way to go?
>>>>>> >>>>>
>>>>>> >>>>
>>>>>> >>>> I'm all for GitHub:
>>>>>> >>>>
>>>>>> >>>> - git source control and revision numbers
>>>>>> >>>> - we're not able to easily share code in the mailing list
>>>>>> >>>> - we can learn from each others' solutions
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>> An example of mailing list limitations:
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Your mail to 'Omaha' with the subject
>>>>>> >>>
>>>>>> >>>     Re: [omaha] Group Data Science Competition
>>>>>> >>>
>>>>>> >>> Is being held until the list moderator can review it for approval.
>>>>>> >>>
>>>>>> >>> The reason it is being held:
>>>>>> >>>
>>>>>> >>>     Message body is too big: 47004 bytes with a limit of 40 KB
>>>>>> >>>
>>>>>> >>>  (I trimmed out the reply chain; so this may make it through
>>>>>> first)
>>>>>> >>>
>>>>>> >>
>>>>>> >>
>>>>>> >
>>>>>> _______________________________________________
>>>>>> Omaha Python Users Group mailing list
>>>>>> Omaha at python.org
>>>>>> https://mail.python.org/mailman/listinfo/omaha
>>>>>> http://www.OmahaPython.org
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>