[omaha] Group Data Science Competition

Bob Haffner bob.haffner at gmail.com
Sat Dec 17 17:34:04 EST 2016


Just submitted another Linear Regression attempt (0.16136).  Added some
features, both numeric and categorical, and created 3 numerics

-TotalFullBaths
-TotalHalfBaths
-Pool

Notebook attached



On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner <bob.haffner at gmail.com> wrote:

> Just submitted another Linear Regression attempt (0.16136).  Added some
> features, both numeric and categorical, and created 3 numerics
>
> -TotalFullBaths
> -TotalHalfBaths
> -Pool
>
> Notebook attached
>
> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner <wes.turner at gmail.com> wrote:
>
>>
>>
>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner <wes.turner at gmail.com> wrote:
>>
>>>
>>>
>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha <omaha at python.org
>>> > wrote:
>>>
>>>> >Does Kaggle take the high mark but still give a score for each
>>>> submission?
>>>> Yes.
>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te
>>>> chniques/submissions
>>>>
>>>>
>>>> >Thinking of ways to keep track of which code produced which score; I'll
>>>> >post about the GitHub setup in a bit.
>>>> We could push our notebooks to the github repo?  Maybe include a brief
>>>> description at the top in a markdown cell
>>>>
>>>
>>> In my research [1], I found that the preferred folder structure for
>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/
>>> (outputs);
>>> and that they recommend creating a settings.json with path configuration
>>> (e.g. pointing to input/, src/ data/)
>>>
>>> So, we could put notebooks, folders, and repos in src/ [2].
>>>
>>> runipy is a bit more scriptable than requiring notebook gui interactions
>>> [3].
>>>
>>> We could either hardcode '../input/test.csv' in our .py and .ipnb
>>> sources, or we could write a function in src/data.py to read
>>> '../settings.json' into a dict with the recommended variable names [1]:
>>>
>>>     from data import read_settings_json
>>>     settings = read_settings_json()
>>>     train = pd.read_csv(settings['TRAIN_DATA_PATH'])
>>>     # ....
>>>     pd.write_csv(settings['SUBMISSION_PATH'])
>>>
>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom
>>> ment-267236556
>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/master/src
>>> [3] https://pypi.python.org/pypi/runipy
>>>
>>>
>>>>
>>>> I initially thought github was a good way to go, but I don't know if
>>>> everyone has a github acct or is interested in starting one.   Maybe
>>>> email
>>>> is the way to go?
>>>>
>>>
>>> I'm all for GitHub:
>>>
>>> - git source control and revision numbers
>>> - we're not able to easily share code in the mailing list
>>> - we can learn from each others' solutions
>>>
>>
>> An example of mailing list limitations:
>>
>>
>> Your mail to 'Omaha' with the subject
>>
>>     Re: [omaha] Group Data Science Competition
>>
>> Is being held until the list moderator can review it for approval.
>>
>> The reason it is being held:
>>
>>     Message body is too big: 47004 bytes with a limit of 40 KB
>>
>>  (I trimmed out the reply chain; so this may make it through first)
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kaggle_house_prices.ipynb
Type: application/octet-stream
Size: 29833 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/omaha/attachments/20161217/75aef747/attachment-0001.obj>


More information about the Omaha mailing list