From bob.haffner at gmail.com Thu Dec 1 09:32:49 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Thu, 1 Dec 2016 08:32:49 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Hi All, We're all set for the 12/14 group Kaggle competition kickoff! All experience levels are welcome. Bring your laptop if you'd like, but no biggie if you don't I didn't hear any objections to the Housing Prices competition so let's go with that one https://www.kaggle.com/c/house-prices-advanced-regression-techniques Suggested things to do prior to 12/14 -- Sign up on Kaggle -- Get your machine set up with some pydata libraries (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I recommend the Anaconda distribution if you're just starting out -- Get some basic familiarity with the competition problem and data Let me know if you have any questions. Thanks! Bob On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner wrote: > Good deal. That's 3 of us (Naomi, you and me) by my count. Hopefully > others will join in!! > > I would be game for a December meetup. > > Sent from my iPhone > > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha > wrote: > > > > I would enjoy participating, and learning what you data guys and gals do. > > (I am not a math guy) > > > > If Hubert does not take December, maybe we could have a sprint that > night? > > > > Steve > > > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha > > wrote: > > > >> On Monday, October 17, 2016, Bob Haffner via Omaha > >> wrote: > >> > >>> Hi All, > >>> > >>> A few months ago someone brought up the idea of doing a Kaggle data > >> science > >>> competition as a group. Is there still interest in this? > >>> > >>> Some thoughts. > >>> Not sure of the details, but Kaggle allows individuals to form groups. > >> We > >>> could collaborate thru email (or perhaps something like Slack) and > maybe > >>> meet occasionally. When it's all said and done, we could present at a > >>> monthly meeting. > >> > >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH wiki) could also be > >> useful: > >> > >> - gh-pages branch built from docs/ and nb/ > >> - .ipynb in notebooks/ or nb/ > >> - https://github.com/audreyr/cookiecutter-pypackage/ has packaging and > >> ReadTheDocs config > >> - > >> https://github.com/jupyter/docker-stacks/blob/master/ > >> scipy-notebook/Dockerfile > >> includes conda > >> > >> > >> > >>> > >>> This one looks good. Doesn't end till March 1st which gives us some > time > >>> and it doesn't look overly complicated. No prize money, though :-) > >>> https://www.kaggle.com/c/house-prices-advanced-regression-techniques > >> > >> > >> - http://rhiever.github.io/tpot/examples/Boston_Example/ > >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle competition > >> description) > >> > >> > >> > >> - https://github.com/donnemartin/data-science-ipython-notebooks/ > >> > >> > >>> Forming groups > >>> https://www.kaggle.com/wiki/FormingATeam > >>> > >>> Would love to get some feedback on any of this > >>> > >>> Thanks, > >>> Bob > >>> _______________________________________________ > >>> Omaha Python Users Group mailing list > >>> Omaha at python.org > >>> https://mail.python.org/mailman/listinfo/omaha > >>> http://www.OmahaPython.org > >>> > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> https://mail.python.org/mailman/listinfo/omaha > >> http://www.OmahaPython.org > >> > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > From ted.warren at gmail.com Fri Dec 2 10:04:15 2016 From: ted.warren at gmail.com (Ted Warren) Date: Fri, 2 Dec 2016 09:04:15 -0600 Subject: [omaha] help getting started Message-ID: Hello, My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton University. I am a synaptic physiologist who studies neuronal circuits involved in epilepsy. I am interested in started to write code using Python, but I need some help. I have been going through the following text to learn how to use Python within the context of my field: A primer on scientific programming with python, 3rd Ed. by Hans Petter Langtangen. I have been unable to download the IDLE on my computer with Windows 10. I have seen on the web that there is a bug and have been unable to find anyone on the web who has been able to circumvent the problem for my computer. I was wondering if there is anyone here that could help me. I cannot get off the ground learning until I get the IDE up and running. Just FYI, python is popular in neurophysiology for analyzing and modeling neural circuits ( e.g., these two neurons signal via a capacitative coupling mechanism ). I am just trying to catch up with some of colleagues. If I need to go somewhere else to get an answer for my question, any suggestions for directions would be appreciated. Thank you ahead of time, Ted From ted.warren at gmail.com Fri Dec 2 10:22:47 2016 From: ted.warren at gmail.com (Ted Warren) Date: Fri, 2 Dec 2016 09:22:47 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: Hi Burch, So, I will detail two problems. First, the book had me create a folder and download a python file. The file contains a simple physics equation for calculating how high a ball is if thrown up straight into the air with respect to time. If I click on the file to open it up, a brief command window opens up ( for about half a second ) and then closes immediately. Second, if I try to start idle from the command prompt, it either does the same thing as above, or I get an error message that reads, "Windows cannot find 'idle'. Make sure you typed the name correctly, and then try again. Peace, Ted On Fri, Dec 2, 2016 at 9:15 AM, Burch Kealey wrote: > Ted > > > Can you be more specific. You state "I have been unable to download IDLE > on my computer." > > > That is pretty wide open. > > > With regards > > > Burch > ------------------------------ > *From:* Omaha on > behalf of Ted Warren via Omaha > *Sent:* Friday, December 2, 2016 9:04:15 AM > *To:* omaha at python.org > *Cc:* Ted Warren > *Subject:* [omaha] help getting started > > Hello, > > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton > University. I am a synaptic physiologist who studies neuronal circuits > involved in epilepsy. I am interested in started to write code using > Python, but I need some help. I have been going through the following text > to learn how to use Python within the context of my field: A primer on > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. > > I have been unable to download the IDLE on my computer with Windows 10. I > have seen on the web that there is a bug and have been unable to find > anyone on the web who has been able to circumvent the problem for my > computer. > > I was wondering if there is anyone here that could help me. I cannot get > off the ground learning until I get the IDE up and running. > > Just FYI, python is popular in neurophysiology for analyzing and modeling > neural circuits ( e.g., these two neurons signal via a capacitative > coupling mechanism ). I am just trying to catch up with some of colleagues. > > If I need to go somewhere else to get an answer for my question, any > suggestions for directions would be appreciated. > > Thank you ahead of time, > > Ted > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From cfarrow at enthought.com Fri Dec 2 10:16:59 2016 From: cfarrow at enthought.com (Chris Farrow) Date: Fri, 2 Dec 2016 09:16:59 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: Hi Ted, IDLE is pretty outdated these days. I recommend starting with Enthought Canopy , which will keep the complications of package installation and management in the background until you are ready to learn them. The entirety of Canopy is free for students , as is a subset of Enthought's online training . Disclaimer, I work for Enthought (but those things are free). :) Regards, Chris Farrow On Fri, Dec 2, 2016 at 9:04 AM, Ted Warren via Omaha wrote: > Hello, > > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton > University. I am a synaptic physiologist who studies neuronal circuits > involved in epilepsy. I am interested in started to write code using > Python, but I need some help. I have been going through the following text > to learn how to use Python within the context of my field: A primer on > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. > > I have been unable to download the IDLE on my computer with Windows 10. I > have seen on the web that there is a bug and have been unable to find > anyone on the web who has been able to circumvent the problem for my > computer. > > I was wondering if there is anyone here that could help me. I cannot get > off the ground learning until I get the IDE up and running. > > Just FYI, python is popular in neurophysiology for analyzing and modeling > neural circuits ( e.g., these two neurons signal via a capacitative > coupling mechanism ). I am just trying to catch up with some of colleagues. > > If I need to go somewhere else to get an answer for my question, any > suggestions for directions would be appreciated. > > Thank you ahead of time, > > Ted > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From aaron at keck.io Fri Dec 2 10:06:20 2016 From: aaron at keck.io (Aaron Keck) Date: Fri, 02 Dec 2016 15:06:20 +0000 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: Welcome Warren! I run Python on Windows 10 and have been successful, maybe I can help? What bug are you running into with IDLE? On Fri, Dec 2, 2016 at 9:04 AM Ted Warren via Omaha wrote: > Hello, > > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton > University. I am a synaptic physiologist who studies neuronal circuits > involved in epilepsy. I am interested in started to write code using > Python, but I need some help. I have been going through the following text > to learn how to use Python within the context of my field: A primer on > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. > > I have been unable to download the IDLE on my computer with Windows 10. I > have seen on the web that there is a bug and have been unable to find > anyone on the web who has been able to circumvent the problem for my > computer. > > I was wondering if there is anyone here that could help me. I cannot get > off the ground learning until I get the IDE up and running. > > Just FYI, python is popular in neurophysiology for analyzing and modeling > neural circuits ( e.g., these two neurons signal via a capacitative > coupling mechanism ). I am just trying to catch up with some of colleagues. > > If I need to go somewhere else to get an answer for my question, any > suggestions for directions would be appreciated. > > Thank you ahead of time, > > Ted > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wereapwhatwesow at gmail.com Fri Dec 2 10:32:53 2016 From: wereapwhatwesow at gmail.com (Steve Young) Date: Fri, 2 Dec 2016 09:32:53 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: Hello Ted, I am a Windows user who has been using Python for many years. In the start, I was running Python on Windows, as I did not want to deal with setting up a 2nd computer running some Linux OS, and have to learn that OS as well as Python. I fought with issues (like your IDLE bug) for a couple of years. And the further I dove into Python, it seemed the issues increased. Each one dampening my enthusiasm for programming. I noticed most other Python programmers were using Linux or a Mac, and the issues frustrating me were due to Windows. So, a few years ago, with the maturing of Virtual Machine technology (I use the free VMWare player, but Virtualbox is popular as well, and I think Windows may have VM tech built in that will run Linux - Hyper-V), and the advancement of Ubuntu, and other lightweight Linux distros that make the transition between Windows and Linux much easier, I decided to create a Ubuntu VM to run on my Windows machine. I have not looked back. I still use Windows as my main OS, and only run the VM when programming. It took me a day or so to set up the VM, set up Python on it, got networking tweaked, and learn some linux command line syntax. After that, no more strange bugs with Windows that 99% of the Python community is not interested in. And instead of dealing with obscure Windows issues, I was learning about technology that increases my skill set. If you think you will be using Python for some time into the future, I would heartily recommend taking a bit of time to set up a computer with Ubuntu (this one was easy for me to transition to from Windows, but there are others), or create a VM on your Windows machine (My VM runs a tad slow sometimes, but I prefer that to having to lug 2 laptops around with me). Welcome to Python!! Steve On Fri, Dec 2, 2016 at 9:04 AM, Ted Warren via Omaha wrote: > Hello, > > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton > University. I am a synaptic physiologist who studies neuronal circuits > involved in epilepsy. I am interested in started to write code using > Python, but I need some help. I have been going through the following text > to learn how to use Python within the context of my field: A primer on > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. > > I have been unable to download the IDLE on my computer with Windows 10. I > have seen on the web that there is a bug and have been unable to find > anyone on the web who has been able to circumvent the problem for my > computer. > > I was wondering if there is anyone here that could help me. I cannot get > off the ground learning until I get the IDE up and running. > > Just FYI, python is popular in neurophysiology for analyzing and modeling > neural circuits ( e.g., these two neurons signal via a capacitative > coupling mechanism ). I am just trying to catch up with some of colleagues. > > If I need to go somewhere else to get an answer for my question, any > suggestions for directions would be appreciated. > > Thank you ahead of time, > > Ted > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bkealey at unomaha.edu Fri Dec 2 10:26:41 2016 From: bkealey at unomaha.edu (Burch Kealey) Date: Fri, 2 Dec 2016 15:26:41 +0000 Subject: [omaha] help getting started In-Reply-To: References: , Message-ID: So you have not installed Python on your computer? ________________________________ From: Ted Warren Sent: Friday, December 2, 2016 9:22:47 AM To: Burch Kealey Cc: Omaha Python Users Group Subject: Re: [omaha] help getting started Hi Burch, So, I will detail two problems. First, the book had me create a folder and download a python file. The file contains a simple physics equation for calculating how high a ball is if thrown up straight into the air with respect to time. If I click on the file to open it up, a brief command window opens up ( for about half a second ) and then closes immediately. Second, if I try to start idle from the command prompt, it either does the same thing as above, or I get an error message that reads, "Windows cannot find 'idle'. Make sure you typed the name correctly, and then try again. Peace, Ted On Fri, Dec 2, 2016 at 9:15 AM, Burch Kealey > wrote: Ted Can you be more specific. You state "I have been unable to download IDLE on my computer." That is pretty wide open. With regards Burch ________________________________ From: Omaha > on behalf of Ted Warren via Omaha > Sent: Friday, December 2, 2016 9:04:15 AM To: omaha at python.org Cc: Ted Warren Subject: [omaha] help getting started Hello, My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton University. I am a synaptic physiologist who studies neuronal circuits involved in epilepsy. I am interested in started to write code using Python, but I need some help. I have been going through the following text to learn how to use Python within the context of my field: A primer on scientific programming with python, 3rd Ed. by Hans Petter Langtangen. I have been unable to download the IDLE on my computer with Windows 10. I have seen on the web that there is a bug and have been unable to find anyone on the web who has been able to circumvent the problem for my computer. I was wondering if there is anyone here that could help me. I cannot get off the ground learning until I get the IDE up and running. Just FYI, python is popular in neurophysiology for analyzing and modeling neural circuits ( e.g., these two neurons signal via a capacitative coupling mechanism ). I am just trying to catch up with some of colleagues. If I need to go somewhere else to get an answer for my question, any suggestions for directions would be appreciated. Thank you ahead of time, Ted _______________________________________________ Omaha Python Users Group mailing list Omaha at python.org https://mail.python.org/mailman/listinfo/omaha http://www.OmahaPython.org From wes.turner at gmail.com Fri Dec 2 11:26:08 2016 From: wes.turner at gmail.com (Wes Turner) Date: Fri, 2 Dec 2016 10:26:08 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: On Friday, December 2, 2016, Ted Warren via Omaha wrote: > Hi Burch, > > So, I will detail two problems. > > First, the book had me create a folder and download a python file. The file > contains a simple physics equation for calculating how high a ball is if > thrown up straight into the air with respect to time. If I click on the > file to open it up, a brief command window opens up ( for about half a > second ) and then closes immediately. File type associations. What you want is for the .py file to open in a text editor or IDE; you'll run most Python scripts from a shell (CMD.exe, IPython, IDE Run dialog). https://www.google.com/search?q=anaconda+file+type+associations+windows Anaconda is free. After you install Anaconda (which includes a number of conda packages (libraries, modules, folders of .py, .pyc, .pyo, .so, .dll files); you can run 'conda install spyder' for a decent open source IDE. > > Second, if I try to start idle from the command prompt, it either does the > same thing as above, or I get an error message that reads, "Windows cannot > find 'idle'. Make sure you typed the name correctly, and then try again. cmd.exe > `echo %PATH%` https://www.google.com/search?q=windows+idle+path http://superuser.com/questions/234126/how-can-i-open-python-files-in-idle-from-windows > > Peace, Peace > > Ted > > On Fri, Dec 2, 2016 at 9:15 AM, Burch Kealey > wrote: > > > Ted > > > > > > Can you be more specific. You state "I have been unable to download IDLE > > on my computer." > > > > > > That is pretty wide open. > > > > > > With regards > > > > > > Burch > > ------------------------------ > > *From:* Omaha > on > > behalf of Ted Warren via Omaha > > > *Sent:* Friday, December 2, 2016 9:04:15 AM > > *To:* omaha at python.org > > *Cc:* Ted Warren > > *Subject:* [omaha] help getting started > > > > Hello, > > > > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton > > University. I am a synaptic physiologist who studies neuronal circuits > > involved in epilepsy. I am interested in started to write code using > > Python, but I need some help. I have been going through the following > text > > to learn how to use Python within the context of my field: A primer on > > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. > > > > I have been unable to download the IDLE on my computer with Windows 10. I > > have seen on the web that there is a bug and have been unable to find > > anyone on the web who has been able to circumvent the problem for my > > computer. > > > > I was wondering if there is anyone here that could help me. I cannot get > > off the ground learning until I get the IDE up and running. > > > > Just FYI, python is popular in neurophysiology for analyzing and modeling > > neural circuits ( e.g., these two neurons signal via a capacitative > > coupling mechanism ). I am just trying to catch up with some of > colleagues. > > > > If I need to go somewhere else to get an answer for my question, any > > suggestions for directions would be appreciated. > > > > Thank you ahead of time, > > > > Ted > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Fri Dec 2 11:44:43 2016 From: wes.turner at gmail.com (Wes Turner) Date: Fri, 2 Dec 2016 10:44:43 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: On Friday, December 2, 2016, Ted Warren via Omaha wrote: > Hello, > > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton > University. I am a synaptic physiologist who studies neuronal circuits > involved in epilepsy. I am interested in started to write code using > Python, but I need some help. I have been going through the following text > to learn how to use Python within the context of my field: A primer on > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. There are lots of great Python books. - https://westurner.org/tools/#python - https://westurner.org/tools/#scipy - Scipy Lectures https://scipy-lectures.github.io/ - Lectures on scientific computing with Python https://github.com/jrjohansson/scientific-python-lectures/blob/master/READ - https://learnxinyminutes.com/docs/python/ - https://learnxinyminutes.com/docs/python3/ > I have been unable to download the IDLE on my computer with Windows 10. I > have seen on the web that there is a bug and have been unable to find > anyone on the web who has been able to circumvent the problem for my > computer. > > I was wondering if there is anyone here that could help me. I cannot get > off the ground learning until I get the IDE up and running. Anaconda installs a python (which includes idle, IIRC; there are much better IDEs for debugging: https://westurner.org/wiki/awesome-python-testing#ides ) https://docs.continuum.io/anaconda/install > > Just FYI, python is popular in neurophysiology for analyzing and modeling > neural circuits ( e.g., these two neurons signal via a capacitative > coupling mechanism ). I am just trying to catch up with some of colleagues. http://martinos.org/mne/stable/ http://irakorshunova.github.io/2014/11/27/seizures.html https://conda-forge.github.io ... Tensors (Theano, TensorFlow (phone, workstation, cluster)) https://github.com/rhiever/tpot > > If I need to go somewhere else to get an answer for my question, any > suggestions for directions would be appreciated. Are you looking to hire a Python dev? - https://westurner.org/resume/html/resume.html TDD: Test-Driven-Development - https://westurner.org/2016/10/17/teaching-test-driven-development-first - https://westurner.org/2016/10/18/criteria-for-success-and-test-driven-development - https://wrdrd.com/docs/consulting/software-development#test-driven-development > Thank you ahead of time, > > Ted > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Fri Dec 2 12:25:49 2016 From: wes.turner at gmail.com (Wes Turner) Date: Fri, 2 Dec 2016 11:25:49 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: On Friday, December 2, 2016, Ted Warren wrote: > "Are you looking to hire a Python dev?" > > I am a government researcher. I do not have the funds for this. But I > really do appreciate all the resources. > http://www.scholarpedia.org/article/Recurrent_neural_networks "Broadcasting in Theano vs. Numpy" http://deeplearning.net/software/theano/library/tensor/basic.html#libdoc-tensor-broadcastable - https://docs.docker.com/docker-for-windows/ - https://hub.docker.com/r/kaggle/python/ - https://github.com/jupyter/docker-stacks/ (these include conda (!conda install, conda environment.yml is like pip requirements.txt) ... Reproducibility - ?Ten Simple Rules for Reproducible Computational Research? http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285 DOI: 10.1371/journal.pcbi.1003285 - https://wrdrd.com/docs/consulting/education-technology#jupyter-and-reproducibility - https://wrdrd.com/docs/consulting/research#zotero - TDD! > > Thanks, > Well, maybe you should/could go talk to NSF and NIH NLM. https://wrdrd.com/docs/consulting/data-science#linked-reproducibility #LinkedReproducibility > > Ted > > On Fri, Dec 2, 2016 at 10:44 AM, Wes Turner > wrote: > >> >> >> On Friday, December 2, 2016, Ted Warren via Omaha > > wrote: >> >>> Hello, >>> >>> My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton >>> University. I am a synaptic physiologist who studies neuronal circuits >>> involved in epilepsy. I am interested in started to write code using >>> Python, but I need some help. I have been going through the following >>> text >>> to learn how to use Python within the context of my field: A primer on >>> scientific programming with python, 3rd Ed. by Hans Petter Langtangen. >> >> >> There are lots of great Python books. >> >> - https://westurner.org/tools/#python >> - https://westurner.org/tools/#scipy >> >> - Scipy Lectures >> https://scipy-lectures.github.io/ >> >> - Lectures on scientific computing with Python >> https://github.com/jrjohansson/scientific-python-lectures/ >> blob/master/READ >> >> >> - https://learnxinyminutes.com/docs/python/ >> - https://learnxinyminutes.com/docs/python3/ >> >> >>> I have been unable to download the IDLE on my computer with Windows 10. I >>> have seen on the web that there is a bug and have been unable to find >>> anyone on the web who has been able to circumvent the problem for my >>> computer. >>> >>> I was wondering if there is anyone here that could help me. I cannot get >>> off the ground learning until I get the IDE up and running. >> >> >> Anaconda installs a python (which includes idle, IIRC; there are much >> better IDEs for debugging: >> https://westurner.org/wiki/awesome-python-testing#ides ) >> >> https://docs.continuum.io/anaconda/install >> >> >>> >>> Just FYI, python is popular in neurophysiology for analyzing and modeling >>> neural circuits ( e.g., these two neurons signal via a capacitative >>> coupling mechanism ). I am just trying to catch up with some of >>> colleagues. >> >> >> >> http://martinos.org/mne/stable/ >> >> http://irakorshunova.github.io/2014/11/27/seizures.html >> >> https://conda-forge.github.io >> >> ... Tensors (Theano, TensorFlow (phone, workstation, cluster)) >> >> https://github.com/rhiever/tpot >> >> >>> >>> If I need to go somewhere else to get an answer for my question, any >>> suggestions for directions would be appreciated. >> >> >> Are you looking to hire a Python dev? >> >> - https://westurner.org/resume/html/resume.html >> >> TDD: Test-Driven-Development >> >> - https://westurner.org/2016/10/17/teaching-test-driven-development-first >> - https://westurner.org/2016/10/18/criteria-for-success-and- >> test-driven-development >> - https://wrdrd.com/docs/consulting/software-development# >> test-driven-development >> >> >>> Thank you ahead of time, >>> >>> Ted >>> _______________________________________________ >>> Omaha Python Users Group mailing list >>> Omaha at python.org >>> https://mail.python.org/mailman/listinfo/omaha >>> http://www.OmahaPython.org >>> >> > From jeffh at dundeemt.com Fri Dec 2 14:29:15 2016 From: jeffh at dundeemt.com (Jeff Hinrichs - DM&T) Date: Fri, 2 Dec 2016 13:29:15 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: Ted, It sounds like you might not have Python installed. This would be required before attempting to run or edit Python code. If you haven't or you are unsure, see this link: https://www.londonappdeveloper.com/setting-up-your-windows-10-system-for-python-development-pydev-eclipse-python/ Just follow the steps until you get to the section "Install the Eclipse PyDev Plugin". (You do not need the Development plugin, but the first part is quite good. One minor note, the current version of Python is 2.7.12, use that and not 2.7.9) Then return to your downloaded code and try again. On Fri, Dec 2, 2016 at 9:22 AM, Ted Warren via Omaha wrote: > Hi Burch, > > So, I will detail two problems. > > First, the book had me create a folder and download a python file. The file > contains a simple physics equation for calculating how high a ball is if > thrown up straight into the air with respect to time. If I click on the > file to open it up, a brief command window opens up ( for about half a > second ) and then closes immediately. > > Second, if I try to start idle from the command prompt, it either does the > same thing as above, or I get an error message that reads, "Windows cannot > find 'idle'. Make sure you typed the name correctly, and then try again. > > Peace, > > Ted > > On Fri, Dec 2, 2016 at 9:15 AM, Burch Kealey wrote: > > > Ted > > > > > > Can you be more specific. You state "I have been unable to download IDLE > > on my computer." > > > > > > That is pretty wide open. > > > > > > With regards > > > > > > Burch > > ------------------------------ > > *From:* Omaha on > > behalf of Ted Warren via Omaha > > *Sent:* Friday, December 2, 2016 9:04:15 AM > > *To:* omaha at python.org > > *Cc:* Ted Warren > > *Subject:* [omaha] help getting started > > > > Hello, > > > > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton > > University. I am a synaptic physiologist who studies neuronal circuits > > involved in epilepsy. I am interested in started to write code using > > Python, but I need some help. I have been going through the following > text > > to learn how to use Python within the context of my field: A primer on > > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. > > > > I have been unable to download the IDLE on my computer with Windows 10. I > > have seen on the web that there is a bug and have been unable to find > > anyone on the web who has been able to circumvent the problem for my > > computer. > > > > I was wondering if there is anyone here that could help me. I cannot get > > off the ground learning until I get the IDE up and running. > > > > Just FYI, python is popular in neurophysiology for analyzing and modeling > > neural circuits ( e.g., these two neurons signal via a capacitative > > coupling mechanism ). I am just trying to catch up with some of > colleagues. > > > > If I need to go somewhere else to get an answer for my question, any > > suggestions for directions would be appreciated. > > > > Thank you ahead of time, > > > > Ted > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > -- Best, Jeff Hinrichs 402.218.1473 From ted.warren at gmail.com Fri Dec 2 14:33:47 2016 From: ted.warren at gmail.com (Ted Warren) Date: Fri, 2 Dec 2016 13:33:47 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: I want to thank everyone for their help. I got on the phone with someone and we think we know a way to a solution. It turns out that I do not have all the administrative privileges on my computer, so I will take it to the IT department on campus here and get it looked at. If it works, I will send out one last e-mail saying so. Otherwise, I am going to go through the options that were given to me on this thread one by one. Thanks everyone! Peace, Ted On Fri, Dec 2, 2016 at 1:29 PM, Jeff Hinrichs - DM&T wrote: > Ted, > > It sounds like you might not have Python installed. This would be > required before attempting to run or edit Python code. > If you haven't or you are unsure, see this link: > https://www.londonappdeveloper.com/setting-up-your-windows-10- > system-for-python-development-pydev-eclipse-python/ > > Just follow the steps until you get to the section "Install the Eclipse > PyDev Plugin". > (You do not need the Development plugin, but the first part is quite > good. One minor note, the current version of Python is 2.7.12, use that > and not 2.7.9) > > Then return to your downloaded code and try again. > > On Fri, Dec 2, 2016 at 9:22 AM, Ted Warren via Omaha > wrote: > >> Hi Burch, >> >> So, I will detail two problems. >> >> First, the book had me create a folder and download a python file. The >> file >> contains a simple physics equation for calculating how high a ball is if >> thrown up straight into the air with respect to time. If I click on the >> file to open it up, a brief command window opens up ( for about half a >> second ) and then closes immediately. >> >> Second, if I try to start idle from the command prompt, it either does the >> same thing as above, or I get an error message that reads, "Windows cannot >> find 'idle'. Make sure you typed the name correctly, and then try again. >> >> Peace, >> >> Ted >> >> On Fri, Dec 2, 2016 at 9:15 AM, Burch Kealey wrote: >> >> > Ted >> > >> > >> > Can you be more specific. You state "I have been unable to download >> IDLE >> > on my computer." >> > >> > >> > That is pretty wide open. >> > >> > >> > With regards >> > >> > >> > Burch >> > ------------------------------ >> > *From:* Omaha on >> > behalf of Ted Warren via Omaha >> > *Sent:* Friday, December 2, 2016 9:04:15 AM >> > *To:* omaha at python.org >> > *Cc:* Ted Warren >> > *Subject:* [omaha] help getting started >> >> > >> > Hello, >> > >> > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at >> Creighton >> > University. I am a synaptic physiologist who studies neuronal circuits >> > involved in epilepsy. I am interested in started to write code using >> > Python, but I need some help. I have been going through the following >> text >> > to learn how to use Python within the context of my field: A primer on >> > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. >> > >> > I have been unable to download the IDLE on my computer with Windows 10. >> I >> > have seen on the web that there is a bug and have been unable to find >> > anyone on the web who has been able to circumvent the problem for my >> > computer. >> > >> > I was wondering if there is anyone here that could help me. I cannot get >> > off the ground learning until I get the IDE up and running. >> > >> > Just FYI, python is popular in neurophysiology for analyzing and >> modeling >> > neural circuits ( e.g., these two neurons signal via a capacitative >> > coupling mechanism ). I am just trying to catch up with some of >> colleagues. >> > >> > If I need to go somewhere else to get an answer for my question, any >> > suggestions for directions would be appreciated. >> > >> > Thank you ahead of time, >> > >> > Ted >> > _______________________________________________ >> > Omaha Python Users Group mailing list >> > Omaha at python.org >> > https://mail.python.org/mailman/listinfo/omaha >> > http://www.OmahaPython.org >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > > > -- > Best, > > Jeff Hinrichs > 402.218.1473 <(402)%20218-1473> > > From bkealey at unomaha.edu Fri Dec 2 10:15:54 2016 From: bkealey at unomaha.edu (Burch Kealey) Date: Fri, 2 Dec 2016 15:15:54 +0000 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: Ted Can you be more specific. You state "I have been unable to download IDLE on my computer." That is pretty wide open. With regards Burch ________________________________ From: Omaha on behalf of Ted Warren via Omaha Sent: Friday, December 2, 2016 9:04:15 AM To: omaha at python.org Cc: Ted Warren Subject: [omaha] help getting started Hello, My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton University. I am a synaptic physiologist who studies neuronal circuits involved in epilepsy. I am interested in started to write code using Python, but I need some help. I have been going through the following text to learn how to use Python within the context of my field: A primer on scientific programming with python, 3rd Ed. by Hans Petter Langtangen. I have been unable to download the IDLE on my computer with Windows 10. I have seen on the web that there is a bug and have been unable to find anyone on the web who has been able to circumvent the problem for my computer. I was wondering if there is anyone here that could help me. I cannot get off the ground learning until I get the IDE up and running. Just FYI, python is popular in neurophysiology for analyzing and modeling neural circuits ( e.g., these two neurons signal via a capacitative coupling mechanism ). I am just trying to catch up with some of colleagues. If I need to go somewhere else to get an answer for my question, any suggestions for directions would be appreciated. Thank you ahead of time, Ted _______________________________________________ Omaha Python Users Group mailing list Omaha at python.org https://mail.python.org/mailman/listinfo/omaha http://www.OmahaPython.org From adam.shaver at gmail.com Fri Dec 2 23:01:02 2016 From: adam.shaver at gmail.com (Adam Shaver) Date: Fri, 2 Dec 2016 22:01:02 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: Ted, I use anaconda python (link ). It is a free python sandbox for scientific programming) when coding python from Windows 10. I find the sandbox is a little bit safer than trying to install it into the Windows Program Files (and later worrying about version clash). Anaconda comes with Scipy and Numpy, which I'm sure you book will cover. As per IDE (and IDLE indirectly), my suggestion would be to use a combination of the Jupyter (IPython) notebook for in-line toy-problem composition with a command-line (or IDLE) to drive bigger things. If you were solving some PDEs of voltage potentials, then you might want to work it out in a notebook. When it's developed into an object oriented piece of code, then you could drop it into your simulation harness and drive it via the command-line or IDLE. At that point, you could (should) probably write unit tests. Best, Adam On Fri, Dec 2, 2016 at 9:04 AM, Ted Warren via Omaha wrote: > Hello, > > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton > University. I am a synaptic physiologist who studies neuronal circuits > involved in epilepsy. I am interested in started to write code using > Python, but I need some help. I have been going through the following text > to learn how to use Python within the context of my field: A primer on > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. > > I have been unable to download the IDLE on my computer with Windows 10. I > have seen on the web that there is a bug and have been unable to find > anyone on the web who has been able to circumvent the problem for my > computer. > > I was wondering if there is anyone here that could help me. I cannot get > off the ground learning until I get the IDE up and running. > > Just FYI, python is popular in neurophysiology for analyzing and modeling > neural circuits ( e.g., these two neurons signal via a capacitative > coupling mechanism ). I am just trying to catch up with some of colleagues. > > If I need to go somewhere else to get an answer for my question, any > suggestions for directions would be appreciated. > > Thank you ahead of time, > > Ted > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sat Dec 3 12:23:08 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sat, 3 Dec 2016 11:23:08 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: On Friday, December 2, 2016, Adam Shaver via Omaha wrote: > Ted, > > I use anaconda python (link ). It is a > free python sandbox for scientific programming) when coding python from > Windows 10. I find the sandbox is a little bit safer than trying to install > it into the Windows Program Files (and later worrying about version clash). > Anaconda comes with Scipy and Numpy, which I'm sure you book will cover. > > As per IDE (and IDLE indirectly), my suggestion would be to use a > combination of the Jupyter (IPython) notebook for in-line toy-problem > composition with a command-line (or IDLE) to drive bigger things. If you > were solving some PDEs of voltage potentials, then you might want to work > it out in a notebook. When it's developed into an object oriented piece of > code, then you could drop it into your simulation harness and drive it via > the command-line or IDLE. > IDLE -> IPython [-> spyder] http://ipython.readthedocs.io/en/stable/overview.html#enhanced-interactive-python-shell IPython Notebook (now Jupyter Notebook) is built on top of IPython. The Jupyter Notebook interface is cool, convenient, great for publishing; but it *is* a an open shell with the permissions of the user account it's running as; running on localhost. You can configure an SSL cert, or run it within a Docker container ( e.g. https://github.com/jupyter/docker-stacks/ ). - https://jupyter-notebook.readthedocs.io/en/latest/public_server.html - https://jupyter-notebook.readthedocs.io/en/latest/security.html Spyder has an IPython console tab/pane. https://pythonhosted.org/spyder/ipythonconsole.html > > At that point, you could (should) probably write > unit tests. The scientific method is testing a hypothesis. The hypothesis is a test (often of significance). (Otherwise it's a null hypothesis (and that may be p-hacking). How is this not confirmation bias? IDK) It's possible to run tests in notebooks: - https://github.com/bollwyvl/nosebook/ - http://github.com/taavi/ipython_nose - https://pypi.python.org/pypi/pytest-ipynb Exploratory analysis, just utilizing an API: Jupyter Notebook, https://github.com/jupyter/nbdime Writing a program/API: git diff, Spyder (vim w/ python-mode, makegreen, and a separate IPython CLI shell) > Best, > Adam > > > On Fri, Dec 2, 2016 at 9:04 AM, Ted Warren via Omaha > > wrote: > > > Hello, > > > > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at Creighton > > University. I am a synaptic physiologist who studies neuronal circuits > > involved in epilepsy. I am interested in started to write code using > > Python, but I need some help. I have been going through the following > text > > to learn how to use Python within the context of my field: A primer on > > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. > > > > I have been unable to download the IDLE on my computer with Windows 10. I > > have seen on the web that there is a bug and have been unable to find > > anyone on the web who has been able to circumvent the problem for my > > computer. > > > > I was wondering if there is anyone here that could help me. I cannot get > > off the ground learning until I get the IDE up and running. > > > > Just FYI, python is popular in neurophysiology for analyzing and modeling > > neural circuits ( e.g., these two neurons signal via a capacitative > > coupling mechanism ). I am just trying to catch up with some of > colleagues. > > > > If I need to go somewhere else to get an answer for my question, any > > suggestions for directions would be appreciated. > > > > Thank you ahead of time, > > > > Ted > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sun Dec 4 13:31:02 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 4 Dec 2016 12:31:02 -0600 Subject: [omaha] help getting started In-Reply-To: References: Message-ID: This may be of use in neural simulation (and a bit OT for a Python mailing list, but field-relevant nonetheless): - "Brain Computation Is Organized via Power-of-Two-Based Permutation Logic" http://journal.frontiersin.org/article/10.3389/fnsys.2016.00095/full *The Theory of Connectivity and its Predictions: A permutation-Based Wiring Logic To Cover Every Possibility* To explore these questions, we have put forth the Theory of Connectivity that proposes a rather simple mathematical rule in organizing the microarchitecture of cell assemblies into the specific-to-general computational primitives that would readily enable knowledge and adaptive behaviors to emerge in the brain (Tsien, 2015a,b; Li et al., 2016). The theory specifies that within each computational building block, termed ?functional connectivity motif? (FCM), the total number of principal projection-cell cliques with distinct inputs should follow the power-of-two-based permutation equation of N = 2i?1 (N is the number of distinct neural cliques that can cover all possible permutations and combinations of specific-to-general input patterns, whereas i is the number of distinct information inputs; Figure 1). As such, each FCM consists of principal projection neuron cliques receiving specific inputs, as well as other principal projection neuron cliques receiving progressively more convergent inputs that systematically cover every possible pattern using the power-of-two-based permutation logic (Figure 1A). IDK how this compares with e.g. "Brian" (which is written in Python) http://briansimulator.org/ ... There are different chips for neural simulation? - https://en.wikipedia.org/wiki/SyNAPSE - https://en.wikipedia.org/wiki/Memristor#Applications - https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks#Hierarchical_temporal_memory IIUC, identifying patterns across time is a relatively hard problem for most existing neural networks? On Sat, Dec 3, 2016 at 11:23 AM, Wes Turner wrote: > > > On Friday, December 2, 2016, Adam Shaver via Omaha > wrote: > >> Ted, >> >> I use anaconda python (link ). It is >> a >> free python sandbox for scientific programming) when coding python from >> Windows 10. I find the sandbox is a little bit safer than trying to >> install >> it into the Windows Program Files (and later worrying about version >> clash). >> Anaconda comes with Scipy and Numpy, which I'm sure you book will cover. >> >> As per IDE (and IDLE indirectly), my suggestion would be to use a >> combination of the Jupyter (IPython) notebook for in-line toy-problem >> composition with a command-line (or IDLE) to drive bigger things. If you >> were solving some PDEs of voltage potentials, then you might want to work >> it out in a notebook. When it's developed into an object oriented piece of >> code, then you could drop it into your simulation harness and drive it via >> the command-line or IDLE. > > >> > IDLE -> IPython [-> spyder] > > http://ipython.readthedocs.io/en/stable/overview.html# > enhanced-interactive-python-shell > > IPython Notebook (now Jupyter Notebook) is built on top of IPython. The > Jupyter Notebook interface is cool, convenient, great for publishing; but > it *is* a an open shell with the permissions of the user account it's > running as; running on localhost. You can configure an SSL cert, or run it > within a Docker container ( e.g. https://github.com/jupyter/docker-stacks/ > ). > > - https://jupyter-notebook.readthedocs.io/en/latest/public_server.html > - https://jupyter-notebook.readthedocs.io/en/latest/security.html > > Spyder has an IPython console tab/pane. > > https://pythonhosted.org/spyder/ipythonconsole.html > > > >> >> At that point, you could (should) probably write >> unit tests. > > > The scientific method is testing a hypothesis. The hypothesis is a test > (often of significance). (Otherwise it's a null hypothesis (and that may be > p-hacking). How is this not confirmation bias? IDK) > > It's possible to run tests in notebooks: > > - https://github.com/bollwyvl/nosebook/ > - http://github.com/taavi/ipython_nose > - https://pypi.python.org/pypi/pytest-ipynb > > Exploratory analysis, just utilizing an API: Jupyter Notebook, > https://github.com/jupyter/nbdime > > Writing a program/API: git diff, Spyder (vim w/ python-mode, makegreen, > and a separate IPython CLI shell) > > >> Best, >> Adam >> >> >> On Fri, Dec 2, 2016 at 9:04 AM, Ted Warren via Omaha >> wrote: >> >> > Hello, >> > >> > My name is Ted Warren, Ph.D. I am a post-doctoral researcher at >> Creighton >> > University. I am a synaptic physiologist who studies neuronal circuits >> > involved in epilepsy. I am interested in started to write code using >> > Python, but I need some help. I have been going through the following >> text >> > to learn how to use Python within the context of my field: A primer on >> > scientific programming with python, 3rd Ed. by Hans Petter Langtangen. >> > >> > I have been unable to download the IDLE on my computer with Windows 10. >> I >> > have seen on the web that there is a bug and have been unable to find >> > anyone on the web who has been able to circumvent the problem for my >> > computer. >> > >> > I was wondering if there is anyone here that could help me. I cannot get >> > off the ground learning until I get the IDE up and running. >> > >> > Just FYI, python is popular in neurophysiology for analyzing and >> modeling >> > neural circuits ( e.g., these two neurons signal via a capacitative >> > coupling mechanism ). I am just trying to catch up with some of >> colleagues. >> > >> > If I need to go somewhere else to get an answer for my question, any >> > suggestions for directions would be appreciated. >> > >> > Thank you ahead of time, >> > >> > Ted >> > _______________________________________________ >> > Omaha Python Users Group mailing list >> > Omaha at python.org >> > https://mail.python.org/mailman/listinfo/omaha >> > http://www.OmahaPython.org >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > From wereapwhatwesow at gmail.com Fri Dec 9 12:16:39 2016 From: wereapwhatwesow at gmail.com (Steve Young) Date: Fri, 9 Dec 2016 11:16:39 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: > > Sign up for Kaggle - Check. > Install Anaconda - Check https://docs.continuum.io/anaconda/install > Basic familiarity - Check. http://conda.pydata.org > /docs/test-drive.html#managing-conda > Anaconda cheat sheet - Check. > http://conda.pydata.org/docs/using/cheatsheet.html > Pycharm and Anaconda - Check. https://www.jetbrains. > com/help/pycharm/2016.1/conda-support-creating-conda-environment.html > > Steve > > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha > wrote: > >> Hi All, >> >> We're all set for the 12/14 group Kaggle competition kickoff! >> >> All experience levels are welcome. Bring your laptop if you'd like, but >> no >> biggie if you don't >> >> I didn't hear any objections to the Housing Prices competition so let's go >> with that one >> https://www.kaggle.com/c/house-prices-advanced-regression-techniques >> >> Suggested things to do prior to 12/14 >> -- Sign up on Kaggle >> -- Get your machine set up with some pydata libraries >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I recommend the >> Anaconda distribution if you're just starting out >> -- Get some basic familiarity with the competition problem and data >> >> Let me know if you have any questions. >> >> Thanks! >> Bob >> >> >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner >> wrote: >> >> > Good deal. That's 3 of us (Naomi, you and me) by my count. Hopefully >> > others will join in!! >> > >> > I would be game for a December meetup. >> > >> > Sent from my iPhone >> > >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha >> > wrote: >> > > >> > > I would enjoy participating, and learning what you data guys and gals >> do. >> > > (I am not a math guy) >> > > >> > > If Hubert does not take December, maybe we could have a sprint that >> > night? >> > > >> > > Steve >> > > >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < >> omaha at python.org> >> > > wrote: >> > > >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha > > >> > >> wrote: >> > >> >> > >>> Hi All, >> > >>> >> > >>> A few months ago someone brought up the idea of doing a Kaggle data >> > >> science >> > >>> competition as a group. Is there still interest in this? >> > >>> >> > >>> Some thoughts. >> > >>> Not sure of the details, but Kaggle allows individuals to form >> groups. >> > >> We >> > >>> could collaborate thru email (or perhaps something like Slack) and >> > maybe >> > >>> meet occasionally. When it's all said and done, we could present >> at a >> > >>> monthly meeting. >> > >> >> > >> >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH wiki) could also >> be >> > >> useful: >> > >> >> > >> - gh-pages branch built from docs/ and nb/ >> > >> - .ipynb in notebooks/ or nb/ >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ has packaging >> and >> > >> ReadTheDocs config >> > >> - >> > >> https://github.com/jupyter/docker-stacks/blob/master/ >> > >> scipy-notebook/Dockerfile >> > >> includes conda >> > >> >> > >> >> > >> >> > >>> >> > >>> This one looks good. Doesn't end till March 1st which gives us some >> > time >> > >>> and it doesn't look overly complicated. No prize money, though :-) >> > >>> https://www.kaggle.com/c/house-prices-advanced-regression- >> techniques >> > >> >> > >> >> > >> - http://rhiever.github.io/tpot/examples/Boston_Example/ >> > >> >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle competition >> > >> description) >> > >> >> > >> >> > >> >> > >> - https://github.com/donnemartin/data-science-ipython-notebooks/ >> > >> >> > >> >> > >>> Forming groups >> > >>> https://www.kaggle.com/wiki/FormingATeam >> > >>> >> > >>> Would love to get some feedback on any of this >> > >>> >> > >>> Thanks, >> > >>> Bob >> > >>> _______________________________________________ >> > >>> Omaha Python Users Group mailing list >> > >>> Omaha at python.org >> > >>> https://mail.python.org/mailman/listinfo/omaha >> > >>> http://www.OmahaPython.org >> > >>> >> > >> _______________________________________________ >> > >> Omaha Python Users Group mailing list >> > >> Omaha at python.org >> > >> https://mail.python.org/mailman/listinfo/omaha >> > >> http://www.OmahaPython.org >> > >> >> > > _______________________________________________ >> > > Omaha Python Users Group mailing list >> > > Omaha at python.org >> > > https://mail.python.org/mailman/listinfo/omaha >> > > http://www.OmahaPython.org >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > From wes.turner at gmail.com Fri Dec 9 13:31:35 2016 From: wes.turner at gmail.com (Wes Turner) Date: Fri, 9 Dec 2016 12:31:35 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: So, we need to mutate and crossover until Mean Squared Error (MSE) is optimally minimized? http://rhiever.github.io/tpot/examples/Boston_Example/ Looks like we need something like load_boston() in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/base.py https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data On Friday, December 9, 2016, Steve Young via Omaha wrote: > > > > Sign up for Kaggle - Check. > > Install Anaconda - Check https://docs.continuum.io/anaconda/install > > Basic familiarity - Check. http://conda.pydata.org > > /docs/test-drive.html#managing-conda > > Anaconda cheat sheet - Check. > > http://conda.pydata.org/docs/using/cheatsheet.html > > Pycharm and Anaconda - Check. https://www.jetbrains. > > com/help/pycharm/2016.1/conda-support-creating-conda-environment.html > > > > Steve > > > > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha > > > wrote: > > > >> Hi All, > >> > >> We're all set for the 12/14 group Kaggle competition kickoff! > >> > >> All experience levels are welcome. Bring your laptop if you'd like, but > >> no > >> biggie if you don't > >> > >> I didn't hear any objections to the Housing Prices competition so let's > go > >> with that one > >> https://www.kaggle.com/c/house-prices-advanced-regression-techniques > >> > >> Suggested things to do prior to 12/14 > >> -- Sign up on Kaggle > >> -- Get your machine set up with some pydata libraries > >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I recommend the > >> Anaconda distribution if you're just starting out > >> -- Get some basic familiarity with the competition problem and data > >> > >> Let me know if you have any questions. > >> > >> Thanks! > >> Bob > >> > >> > >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner > > >> wrote: > >> > >> > Good deal. That's 3 of us (Naomi, you and me) by my count. Hopefully > >> > others will join in!! > >> > > >> > I would be game for a December meetup. > >> > > >> > Sent from my iPhone > >> > > >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha < > omaha at python.org > > >> > wrote: > >> > > > >> > > I would enjoy participating, and learning what you data guys and > gals > >> do. > >> > > (I am not a math guy) > >> > > > >> > > If Hubert does not take December, maybe we could have a sprint that > >> > night? > >> > > > >> > > Steve > >> > > > >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < > >> omaha at python.org > > >> > > wrote: > >> > > > >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha < > omaha at python.org > >> > > >> > >> wrote: > >> > >> > >> > >>> Hi All, > >> > >>> > >> > >>> A few months ago someone brought up the idea of doing a Kaggle > data > >> > >> science > >> > >>> competition as a group. Is there still interest in this? > >> > >>> > >> > >>> Some thoughts. > >> > >>> Not sure of the details, but Kaggle allows individuals to form > >> groups. > >> > >> We > >> > >>> could collaborate thru email (or perhaps something like Slack) and > >> > maybe > >> > >>> meet occasionally. When it's all said and done, we could present > >> at a > >> > >>> monthly meeting. > >> > >> > >> > >> > >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH wiki) could also > >> be > >> > >> useful: > >> > >> > >> > >> - gh-pages branch built from docs/ and nb/ > >> > >> - .ipynb in notebooks/ or nb/ > >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ has packaging > >> and > >> > >> ReadTheDocs config > >> > >> - > >> > >> https://github.com/jupyter/docker-stacks/blob/master/ > >> > >> scipy-notebook/Dockerfile > >> > >> includes conda > >> > >> > >> > >> > >> > >> > >> > >>> > >> > >>> This one looks good. Doesn't end till March 1st which gives us > some > >> > time > >> > >>> and it doesn't look overly complicated. No prize money, though > :-) > >> > >>> https://www.kaggle.com/c/house-prices-advanced-regression- > >> techniques > >> > >> > >> > >> > >> > >> - http://rhiever.github.io/tpot/examples/Boston_Example/ > >> > >> > >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle competition > >> > >> description) > >> > >> > >> > >> > >> > >> > >> > >> - https://github.com/donnemartin/data-science-ipython-notebooks/ > >> > >> > >> > >> > >> > >>> Forming groups > >> > >>> https://www.kaggle.com/wiki/FormingATeam > >> > >>> > >> > >>> Would love to get some feedback on any of this > >> > >>> > >> > >>> Thanks, > >> > >>> Bob > >> > >>> _______________________________________________ > >> > >>> Omaha Python Users Group mailing list > >> > >>> Omaha at python.org > >> > >>> https://mail.python.org/mailman/listinfo/omaha > >> > >>> http://www.OmahaPython.org > >> > >>> > >> > >> _______________________________________________ > >> > >> Omaha Python Users Group mailing list > >> > >> Omaha at python.org > >> > >> https://mail.python.org/mailman/listinfo/omaha > >> > >> http://www.OmahaPython.org > >> > >> > >> > > _______________________________________________ > >> > > Omaha Python Users Group mailing list > >> > > Omaha at python.org > >> > > https://mail.python.org/mailman/listinfo/omaha > >> > > http://www.OmahaPython.org > >> > > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> https://mail.python.org/mailman/listinfo/omaha > >> http://www.OmahaPython.org > >> > > > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From choman at gmail.com Sat Dec 10 20:52:20 2016 From: choman at gmail.com (Chad Homan) Date: Sat, 10 Dec 2016 19:52:20 -0600 Subject: [omaha] Python 3.6 Message-ID: https://fossbytes.com/python-3-6-released-new-features/ Together We Win! Looking for cloud storage, try pCloud (10g free ) -- Chad Some people, when confronted with a problem, think "I know, I'll use Windows." Now they have two problems. Some people claim if you play a Windows Install Disc backwards you'll hear satanic Messages. That's nothing, if you play it forward it installs Windows From wes.turner at gmail.com Sat Dec 10 21:51:09 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sat, 10 Dec 2016 20:51:09 -0600 Subject: [omaha] Python 3.6 In-Reply-To: References: Message-ID: https://docs.python.org/3.6/whatsnew/3.6.html https://github.com/python/cpython/compare/3.5...3.6 On Saturday, December 10, 2016, Chad Homan via Omaha wrote: > https://fossbytes.com/python-3-6-released-new-features/ > > > Together We Win! Looking for cloud storage, try pCloud > (10g free > ) > -- > Chad > > Some people, when confronted with a problem, think "I know, I'll use > Windows." > Now they have two problems. > > Some people claim if you play a Windows Install Disc backwards you'll hear > satanic Messages. > That's nothing, if you play it forward it installs Windows > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bob.haffner at gmail.com Tue Dec 13 10:49:06 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Tue, 13 Dec 2016 09:49:06 -0600 Subject: [omaha] Group Data Science Competition (12/14) Reminder Message-ID: Tomorrow (12/14) at 6:30pm. Room 1 at Do Space Do Space 7205 Dodge St, Omaha, NE 68114 From wes.turner at gmail.com Wed Dec 14 13:19:29 2016 From: wes.turner at gmail.com (Wes Turner) Date: Wed, 14 Dec 2016 12:19:29 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: - https://github.com/westurner/house_prices (BSD 3-Clause, TPOT is GPLv3) - https://github.com/westurner/house_prices/commits/develop - https://github.com/westurner/house_prices/blob/develop/environment.yml - https://github.com/westurner/house_prices/blob/develop/tests/test_house_prices.py - https://github.com/westurner/house_prices/blob/develop/house_prices/analysis.py - https://github.com/westurner/house_prices/blob/develop/house_prices/data.py cookiecutter, hubflow data.py loads the data into a sklearn Bunch after pd.do_get_dummies (and dataclean.autoclean, while I figure out how to use OneHotEncoder). - [ ] update the docstrings from load_boston() - https://github.com/rhiever/datacleaner/issues/1#issuecomment-266980937 "I think it illogical to e.g. average Exterior1st in the Kaggle House Prices Dataset: the average of ImStucc and Wd Sdng seems nonsensical?" analysis.py HEAD-1 (before I wrapped it in a class while waiting) looks like it'll be another 4-5 hours on this notebook from 2009. - PCA? There's probably a better way. I added an environment.yml but haven't yet determined the minimal set for setup.py, so conda env update -f=environment.yml # make condaenvupdate I'm supposed to be at work now. I may not be able to make it this evening; if not, good luck. I'll add the generated pipeline to the github repo and share the URL. On Fri, Dec 9, 2016 at 12:31 PM, Wes Turner wrote: > So, we need to mutate and crossover until Mean Squared Error (MSE) is > optimally minimized? > > http://rhiever.github.io/tpot/examples/Boston_Example/ > > Looks like we need something like load_boston() in > https://github.com/scikit-learn/scikit-learn/blob/ > master/sklearn/datasets/base.py > > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data > > > On Friday, December 9, 2016, Steve Young via Omaha > wrote: > >> > >> > Sign up for Kaggle - Check. >> > Install Anaconda - Check https://docs.continuum.io/anaconda/install >> > Basic familiarity - Check. http://conda.pydata.org >> > /docs/test-drive.html#managing-conda >> > Anaconda cheat sheet - Check. >> > http://conda.pydata.org/docs/using/cheatsheet.html >> > Pycharm and Anaconda - Check. https://www.jetbrains. >> > com/help/pycharm/2016.1/conda-support-creating-conda-environment.html >> > >> > Steve >> > >> > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha > > >> > wrote: >> > >> >> Hi All, >> >> >> >> We're all set for the 12/14 group Kaggle competition kickoff! >> >> >> >> All experience levels are welcome. Bring your laptop if you'd like, >> but >> >> no >> >> biggie if you don't >> >> >> >> I didn't hear any objections to the Housing Prices competition so >> let's go >> >> with that one >> >> https://www.kaggle.com/c/house-prices-advanced-regression-techniques >> >> >> >> Suggested things to do prior to 12/14 >> >> -- Sign up on Kaggle >> >> -- Get your machine set up with some pydata libraries >> >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I recommend the >> >> Anaconda distribution if you're just starting out >> >> -- Get some basic familiarity with the competition problem and data >> >> >> >> Let me know if you have any questions. >> >> >> >> Thanks! >> >> Bob >> >> >> >> >> >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner >> >> wrote: >> >> >> >> > Good deal. That's 3 of us (Naomi, you and me) by my count. Hopefully >> >> > others will join in!! >> >> > >> >> > I would be game for a December meetup. >> >> > >> >> > Sent from my iPhone >> >> > >> >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha < >> omaha at python.org> >> >> > wrote: >> >> > > >> >> > > I would enjoy participating, and learning what you data guys and >> gals >> >> do. >> >> > > (I am not a math guy) >> >> > > >> >> > > If Hubert does not take December, maybe we could have a sprint that >> >> > night? >> >> > > >> >> > > Steve >> >> > > >> >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < >> >> omaha at python.org> >> >> > > wrote: >> >> > > >> >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha < >> omaha at python.org >> >> > >> >> > >> wrote: >> >> > >> >> >> > >>> Hi All, >> >> > >>> >> >> > >>> A few months ago someone brought up the idea of doing a Kaggle >> data >> >> > >> science >> >> > >>> competition as a group. Is there still interest in this? >> >> > >>> >> >> > >>> Some thoughts. >> >> > >>> Not sure of the details, but Kaggle allows individuals to form >> >> groups. >> >> > >> We >> >> > >>> could collaborate thru email (or perhaps something like Slack) >> and >> >> > maybe >> >> > >>> meet occasionally. When it's all said and done, we could present >> >> at a >> >> > >>> monthly meeting. >> >> > >> >> >> > >> >> >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH wiki) could >> also >> >> be >> >> > >> useful: >> >> > >> >> >> > >> - gh-pages branch built from docs/ and nb/ >> >> > >> - .ipynb in notebooks/ or nb/ >> >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ has >> packaging >> >> and >> >> > >> ReadTheDocs config >> >> > >> - >> >> > >> https://github.com/jupyter/docker-stacks/blob/master/ >> >> > >> scipy-notebook/Dockerfile >> >> > >> includes conda >> >> > >> >> >> > >> >> >> > >> >> >> > >>> >> >> > >>> This one looks good. Doesn't end till March 1st which gives us >> some >> >> > time >> >> > >>> and it doesn't look overly complicated. No prize money, though >> :-) >> >> > >>> https://www.kaggle.com/c/house-prices-advanced-regression- >> >> techniques >> >> > >> >> >> > >> >> >> > >> - http://rhiever.github.io/tpot/examples/Boston_Example/ >> >> > >> >> >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle >> competition >> >> > >> description) >> >> > >> >> >> > >> >> >> > >> >> >> > >> - https://github.com/donnemartin/data-science-ipython-notebooks/ >> >> > >> >> >> > >> >> >> > >>> Forming groups >> >> > >>> https://www.kaggle.com/wiki/FormingATeam >> >> > >>> >> >> > >>> Would love to get some feedback on any of this >> >> > >>> >> >> > >>> Thanks, >> >> > >>> Bob >> >> > >>> _______________________________________________ >> >> > >>> Omaha Python Users Group mailing list >> >> > >>> Omaha at python.org >> >> > >>> https://mail.python.org/mailman/listinfo/omaha >> >> > >>> http://www.OmahaPython.org >> >> > >>> >> >> > >> _______________________________________________ >> >> > >> Omaha Python Users Group mailing list >> >> > >> Omaha at python.org >> >> > >> https://mail.python.org/mailman/listinfo/omaha >> >> > >> http://www.OmahaPython.org >> >> > >> >> >> > > _______________________________________________ >> >> > > Omaha Python Users Group mailing list >> >> > > Omaha at python.org >> >> > > https://mail.python.org/mailman/listinfo/omaha >> >> > > http://www.OmahaPython.org >> >> > >> >> _______________________________________________ >> >> Omaha Python Users Group mailing list >> >> Omaha at python.org >> >> https://mail.python.org/mailman/listinfo/omaha >> >> http://www.OmahaPython.org >> >> >> > >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > From bob.haffner at gmail.com Wed Dec 14 23:16:05 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Wed, 14 Dec 2016 22:16:05 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Good turnout tonight! We managed to form a team, explore some data, pick some features, fit a model and make a submission. We are currently sitting just outside the top 10 in 2,350th place :-) We'll be climbing the ranks in no time though! Still room for more folks if anyone else is interested. FYI, we're going to try and meet in January On Wed, Dec 14, 2016 at 12:19 PM, Wes Turner via Omaha wrote: > - https://github.com/westurner/house_prices (BSD 3-Clause, TPOT is GPLv3) > - https://github.com/westurner/house_prices/commits/develop > - https://github.com/westurner/house_prices/blob/develop/environment.yml > - > https://github.com/westurner/house_prices/blob/develop/ > tests/test_house_prices.py > - > https://github.com/westurner/house_prices/blob/develop/ > house_prices/analysis.py > - > https://github.com/westurner/house_prices/blob/develop/ > house_prices/data.py > > cookiecutter, hubflow > > data.py loads the data into a sklearn Bunch after pd.do_get_dummies (and > dataclean.autoclean, while I figure out how to use OneHotEncoder). > > - [ ] update the docstrings from load_boston() > - https://github.com/rhiever/datacleaner/issues/1#issuecomment-266980937 > "I think it illogical to e.g. average Exterior1st in the Kaggle House > Prices Dataset: the average of ImStucc and Wd Sdng seems nonsensical?" > > analysis.py HEAD-1 (before I wrapped it in a class while waiting) looks > like it'll be another 4-5 hours on this notebook from 2009. > > - PCA? There's probably a better way. > > I added an environment.yml but haven't yet determined the minimal set for > setup.py, so > > conda env update -f=environment.yml # make condaenvupdate > > > I'm supposed to be at work now. I may not be able to make it this evening; > if not, good luck. > > I'll add the generated pipeline to the github repo and share the URL. > > > On Fri, Dec 9, 2016 at 12:31 PM, Wes Turner wrote: > > > So, we need to mutate and crossover until Mean Squared Error (MSE) is > > optimally minimized? > > > > http://rhiever.github.io/tpot/examples/Boston_Example/ > > > > Looks like we need something like load_boston() in > > https://github.com/scikit-learn/scikit-learn/blob/ > > master/sklearn/datasets/base.py > > > > > > https://www.kaggle.com/c/house-prices-advanced- > regression-techniques/data > > > > > > On Friday, December 9, 2016, Steve Young via Omaha > > wrote: > > > >> > > >> > Sign up for Kaggle - Check. > >> > Install Anaconda - Check https://docs.continuum.io/anaconda/install > >> > Basic familiarity - Check. http://conda.pydata.org > >> > /docs/test-drive.html#managing-conda > >> > Anaconda cheat sheet - Check. > >> > http://conda.pydata.org/docs/using/cheatsheet.html > >> > Pycharm and Anaconda - Check. https://www.jetbrains. > >> > com/help/pycharm/2016.1/conda-support-creating-conda-environment.html > >> > > >> > Steve > >> > > >> > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha < > omaha at python.org > >> > > >> > wrote: > >> > > >> >> Hi All, > >> >> > >> >> We're all set for the 12/14 group Kaggle competition kickoff! > >> >> > >> >> All experience levels are welcome. Bring your laptop if you'd like, > >> but > >> >> no > >> >> biggie if you don't > >> >> > >> >> I didn't hear any objections to the Housing Prices competition so > >> let's go > >> >> with that one > >> >> https://www.kaggle.com/c/house-prices-advanced-regression-techniques > >> >> > >> >> Suggested things to do prior to 12/14 > >> >> -- Sign up on Kaggle > >> >> -- Get your machine set up with some pydata libraries > >> >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I recommend the > >> >> Anaconda distribution if you're just starting out > >> >> -- Get some basic familiarity with the competition problem and data > >> >> > >> >> Let me know if you have any questions. > >> >> > >> >> Thanks! > >> >> Bob > >> >> > >> >> > >> >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner > >> >> wrote: > >> >> > >> >> > Good deal. That's 3 of us (Naomi, you and me) by my count. > Hopefully > >> >> > others will join in!! > >> >> > > >> >> > I would be game for a December meetup. > >> >> > > >> >> > Sent from my iPhone > >> >> > > >> >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha < > >> omaha at python.org> > >> >> > wrote: > >> >> > > > >> >> > > I would enjoy participating, and learning what you data guys and > >> gals > >> >> do. > >> >> > > (I am not a math guy) > >> >> > > > >> >> > > If Hubert does not take December, maybe we could have a sprint > that > >> >> > night? > >> >> > > > >> >> > > Steve > >> >> > > > >> >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < > >> >> omaha at python.org> > >> >> > > wrote: > >> >> > > > >> >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha < > >> omaha at python.org > >> >> > > >> >> > >> wrote: > >> >> > >> > >> >> > >>> Hi All, > >> >> > >>> > >> >> > >>> A few months ago someone brought up the idea of doing a Kaggle > >> data > >> >> > >> science > >> >> > >>> competition as a group. Is there still interest in this? > >> >> > >>> > >> >> > >>> Some thoughts. > >> >> > >>> Not sure of the details, but Kaggle allows individuals to form > >> >> groups. > >> >> > >> We > >> >> > >>> could collaborate thru email (or perhaps something like Slack) > >> and > >> >> > maybe > >> >> > >>> meet occasionally. When it's all said and done, we could > present > >> >> at a > >> >> > >>> monthly meeting. > >> >> > >> > >> >> > >> > >> >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH wiki) could > >> also > >> >> be > >> >> > >> useful: > >> >> > >> > >> >> > >> - gh-pages branch built from docs/ and nb/ > >> >> > >> - .ipynb in notebooks/ or nb/ > >> >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ has > >> packaging > >> >> and > >> >> > >> ReadTheDocs config > >> >> > >> - > >> >> > >> https://github.com/jupyter/docker-stacks/blob/master/ > >> >> > >> scipy-notebook/Dockerfile > >> >> > >> includes conda > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >>> > >> >> > >>> This one looks good. Doesn't end till March 1st which gives us > >> some > >> >> > time > >> >> > >>> and it doesn't look overly complicated. No prize money, though > >> :-) > >> >> > >>> https://www.kaggle.com/c/house-prices-advanced-regression- > >> >> techniques > >> >> > >> > >> >> > >> > >> >> > >> - http://rhiever.github.io/tpot/examples/Boston_Example/ > >> >> > >> > >> >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle > >> competition > >> >> > >> description) > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> - https://github.com/donnemartin/data-science- > ipython-notebooks/ > >> >> > >> > >> >> > >> > >> >> > >>> Forming groups > >> >> > >>> https://www.kaggle.com/wiki/FormingATeam > >> >> > >>> > >> >> > >>> Would love to get some feedback on any of this > >> >> > >>> > >> >> > >>> Thanks, > >> >> > >>> Bob > >> >> > >>> _______________________________________________ > >> >> > >>> Omaha Python Users Group mailing list > >> >> > >>> Omaha at python.org > >> >> > >>> https://mail.python.org/mailman/listinfo/omaha > >> >> > >>> http://www.OmahaPython.org > >> >> > >>> > >> >> > >> _______________________________________________ > >> >> > >> Omaha Python Users Group mailing list > >> >> > >> Omaha at python.org > >> >> > >> https://mail.python.org/mailman/listinfo/omaha > >> >> > >> http://www.OmahaPython.org > >> >> > >> > >> >> > > _______________________________________________ > >> >> > > Omaha Python Users Group mailing list > >> >> > > Omaha at python.org > >> >> > > https://mail.python.org/mailman/listinfo/omaha > >> >> > > http://www.OmahaPython.org > >> >> > > >> >> _______________________________________________ > >> >> Omaha Python Users Group mailing list > >> >> Omaha at python.org > >> >> https://mail.python.org/mailman/listinfo/omaha > >> >> http://www.OmahaPython.org > >> >> > >> > > >> > > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> https://mail.python.org/mailman/listinfo/omaha > >> http://www.OmahaPython.org > >> > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bob.haffner at gmail.com Fri Dec 16 23:37:50 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Fri, 16 Dec 2016 22:37:50 -0600 Subject: [omaha] Kaggle Competition Notebook Message-ID: I cleaned up Wednesday night's jupyter notebook and added some comments -------------- next part -------------- A non-text attachment was scrubbed... Name: kaggle_house_prices.ipynb Type: application/octet-stream Size: 26024 bytes Desc: not available URL: From bob.haffner at gmail.com Sat Dec 17 10:24:35 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Sat, 17 Dec 2016 09:24:35 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: We jumped up 58 spots thanks to Jeremy! We are allotted 5 submissions per day so feel free to give it a go On Wed, Dec 14, 2016 at 10:16 PM, Bob Haffner wrote: > Good turnout tonight! We managed to form a team, explore some data, pick > some features, fit a model and make a submission. > > We are currently sitting just outside the top 10 in 2,350th place :-) > We'll be climbing the ranks in no time though! > > Still room for more folks if anyone else is interested. FYI, we're going > to try and meet in January > > On Wed, Dec 14, 2016 at 12:19 PM, Wes Turner via Omaha > wrote: > >> - https://github.com/westurner/house_prices (BSD 3-Clause, TPOT is GPLv3) >> - https://github.com/westurner/house_prices/commits/develop >> - https://github.com/westurner/house_prices/blob/develop/environment.yml >> - >> https://github.com/westurner/house_prices/blob/develop/tests >> /test_house_prices.py >> - >> https://github.com/westurner/house_prices/blob/develop/house >> _prices/analysis.py >> - >> https://github.com/westurner/house_prices/blob/develop/house >> _prices/data.py >> >> cookiecutter, hubflow >> >> data.py loads the data into a sklearn Bunch after pd.do_get_dummies (and >> dataclean.autoclean, while I figure out how to use OneHotEncoder). >> >> - [ ] update the docstrings from load_boston() >> - https://github.com/rhiever/datacleaner/issues/1#issuecomment-266980937 >> "I think it illogical to e.g. average Exterior1st in the Kaggle House >> Prices Dataset: the average of ImStucc and Wd Sdng seems nonsensical?" >> >> analysis.py HEAD-1 (before I wrapped it in a class while waiting) looks >> like it'll be another 4-5 hours on this notebook from 2009. >> >> - PCA? There's probably a better way. >> >> I added an environment.yml but haven't yet determined the minimal set for >> setup.py, so >> >> conda env update -f=environment.yml # make condaenvupdate >> >> >> I'm supposed to be at work now. I may not be able to make it this evening; >> if not, good luck. >> >> I'll add the generated pipeline to the github repo and share the URL. >> >> >> On Fri, Dec 9, 2016 at 12:31 PM, Wes Turner wrote: >> >> > So, we need to mutate and crossover until Mean Squared Error (MSE) is >> > optimally minimized? >> > >> > http://rhiever.github.io/tpot/examples/Boston_Example/ >> > >> > Looks like we need something like load_boston() in >> > https://github.com/scikit-learn/scikit-learn/blob/ >> > master/sklearn/datasets/base.py >> > >> > >> > https://www.kaggle.com/c/house-prices-advanced-regression- >> techniques/data >> > >> > >> > On Friday, December 9, 2016, Steve Young via Omaha >> > wrote: >> > >> >> > >> >> > Sign up for Kaggle - Check. >> >> > Install Anaconda - Check https://docs.continuum.io/anaconda/install >> >> > Basic familiarity - Check. http://conda.pydata.org >> >> > /docs/test-drive.html#managing-conda >> >> > Anaconda cheat sheet - Check. >> >> > http://conda.pydata.org/docs/using/cheatsheet.html >> >> > Pycharm and Anaconda - Check. https://www.jetbrains. >> >> > com/help/pycharm/2016.1/conda-support-creating-conda-environ >> ment.html >> >> > >> >> > Steve >> >> > >> >> > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha < >> omaha at python.org >> >> > >> >> > wrote: >> >> > >> >> >> Hi All, >> >> >> >> >> >> We're all set for the 12/14 group Kaggle competition kickoff! >> >> >> >> >> >> All experience levels are welcome. Bring your laptop if you'd like, >> >> but >> >> >> no >> >> >> biggie if you don't >> >> >> >> >> >> I didn't hear any objections to the Housing Prices competition so >> >> let's go >> >> >> with that one >> >> >> https://www.kaggle.com/c/house-prices-advanced-regression- >> techniques >> >> >> >> >> >> Suggested things to do prior to 12/14 >> >> >> -- Sign up on Kaggle >> >> >> -- Get your machine set up with some pydata libraries >> >> >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I recommend >> the >> >> >> Anaconda distribution if you're just starting out >> >> >> -- Get some basic familiarity with the competition problem and data >> >> >> >> >> >> Let me know if you have any questions. >> >> >> >> >> >> Thanks! >> >> >> Bob >> >> >> >> >> >> >> >> >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner > > >> >> >> wrote: >> >> >> >> >> >> > Good deal. That's 3 of us (Naomi, you and me) by my count. >> Hopefully >> >> >> > others will join in!! >> >> >> > >> >> >> > I would be game for a December meetup. >> >> >> > >> >> >> > Sent from my iPhone >> >> >> > >> >> >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha < >> >> omaha at python.org> >> >> >> > wrote: >> >> >> > > >> >> >> > > I would enjoy participating, and learning what you data guys and >> >> gals >> >> >> do. >> >> >> > > (I am not a math guy) >> >> >> > > >> >> >> > > If Hubert does not take December, maybe we could have a sprint >> that >> >> >> > night? >> >> >> > > >> >> >> > > Steve >> >> >> > > >> >> >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < >> >> >> omaha at python.org> >> >> >> > > wrote: >> >> >> > > >> >> >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha < >> >> omaha at python.org >> >> >> > >> >> >> > >> wrote: >> >> >> > >> >> >> >> > >>> Hi All, >> >> >> > >>> >> >> >> > >>> A few months ago someone brought up the idea of doing a Kaggle >> >> data >> >> >> > >> science >> >> >> > >>> competition as a group. Is there still interest in this? >> >> >> > >>> >> >> >> > >>> Some thoughts. >> >> >> > >>> Not sure of the details, but Kaggle allows individuals to form >> >> >> groups. >> >> >> > >> We >> >> >> > >>> could collaborate thru email (or perhaps something like Slack) >> >> and >> >> >> > maybe >> >> >> > >>> meet occasionally. When it's all said and done, we could >> present >> >> >> at a >> >> >> > >>> monthly meeting. >> >> >> > >> >> >> >> > >> >> >> >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH wiki) could >> >> also >> >> >> be >> >> >> > >> useful: >> >> >> > >> >> >> >> > >> - gh-pages branch built from docs/ and nb/ >> >> >> > >> - .ipynb in notebooks/ or nb/ >> >> >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ has >> >> packaging >> >> >> and >> >> >> > >> ReadTheDocs config >> >> >> > >> - >> >> >> > >> https://github.com/jupyter/docker-stacks/blob/master/ >> >> >> > >> scipy-notebook/Dockerfile >> >> >> > >> includes conda >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >>> >> >> >> > >>> This one looks good. Doesn't end till March 1st which gives >> us >> >> some >> >> >> > time >> >> >> > >>> and it doesn't look overly complicated. No prize money, >> though >> >> :-) >> >> >> > >>> https://www.kaggle.com/c/house-prices-advanced-regression- >> >> >> techniques >> >> >> > >> >> >> >> > >> >> >> >> > >> - http://rhiever.github.io/tpot/examples/Boston_Example/ >> >> >> > >> >> >> >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle >> >> competition >> >> >> > >> description) >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> - https://github.com/donnemartin/data-science-ipython- >> notebooks/ >> >> >> > >> >> >> >> > >> >> >> >> > >>> Forming groups >> >> >> > >>> https://www.kaggle.com/wiki/FormingATeam >> >> >> > >>> >> >> >> > >>> Would love to get some feedback on any of this >> >> >> > >>> >> >> >> > >>> Thanks, >> >> >> > >>> Bob >> >> >> > >>> _______________________________________________ >> >> >> > >>> Omaha Python Users Group mailing list >> >> >> > >>> Omaha at python.org >> >> >> > >>> https://mail.python.org/mailman/listinfo/omaha >> >> >> > >>> http://www.OmahaPython.org >> >> >> > >>> >> >> >> > >> _______________________________________________ >> >> >> > >> Omaha Python Users Group mailing list >> >> >> > >> Omaha at python.org >> >> >> > >> https://mail.python.org/mailman/listinfo/omaha >> >> >> > >> http://www.OmahaPython.org >> >> >> > >> >> >> >> > > _______________________________________________ >> >> >> > > Omaha Python Users Group mailing list >> >> >> > > Omaha at python.org >> >> >> > > https://mail.python.org/mailman/listinfo/omaha >> >> >> > > http://www.OmahaPython.org >> >> >> > >> >> >> _______________________________________________ >> >> >> Omaha Python Users Group mailing list >> >> >> Omaha at python.org >> >> >> https://mail.python.org/mailman/listinfo/omaha >> >> >> http://www.OmahaPython.org >> >> >> >> >> > >> >> > >> >> _______________________________________________ >> >> Omaha Python Users Group mailing list >> >> Omaha at python.org >> >> https://mail.python.org/mailman/listinfo/omaha >> >> http://www.OmahaPython.org >> >> >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > From luke.schollmeyer at gmail.com Sat Dec 17 14:19:39 2016 From: luke.schollmeyer at gmail.com (Luke Schollmeyer) Date: Sat, 17 Dec 2016 13:19:39 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Made a quick attempt between middle school basketball games...did nothing more than pull in all non-categorical variables (rather than the five we used) and filled NAs with the variable mean. Jumped the score to 0.25857 - better than Wednesday's but not better than Jeremy's (good job, btw). Not really that much there to add to the notebook. Next attempt I'll still stick with LR, but will work on the categorical's and look at better ways to fill missing values (using the mean is a hack). Luke On Sat, Dec 17, 2016 at 9:24 AM, Bob Haffner via Omaha wrote: > We jumped up 58 spots thanks to Jeremy! We are allotted 5 submissions per > day so feel free to give it a go > > On Wed, Dec 14, 2016 at 10:16 PM, Bob Haffner > wrote: > > > Good turnout tonight! We managed to form a team, explore some data, pick > > some features, fit a model and make a submission. > > > > We are currently sitting just outside the top 10 in 2,350th place :-) > > We'll be climbing the ranks in no time though! > > > > Still room for more folks if anyone else is interested. FYI, we're going > > to try and meet in January > > > > On Wed, Dec 14, 2016 at 12:19 PM, Wes Turner via Omaha > > > wrote: > > > >> - https://github.com/westurner/house_prices (BSD 3-Clause, TPOT is > GPLv3) > >> - https://github.com/westurner/house_prices/commits/develop > >> - https://github.com/westurner/house_prices/blob/develop/ > environment.yml > >> - > >> https://github.com/westurner/house_prices/blob/develop/tests > >> /test_house_prices.py > >> - > >> https://github.com/westurner/house_prices/blob/develop/house > >> _prices/analysis.py > >> - > >> https://github.com/westurner/house_prices/blob/develop/house > >> _prices/data.py > >> > >> cookiecutter, hubflow > >> > >> data.py loads the data into a sklearn Bunch after pd.do_get_dummies (and > >> dataclean.autoclean, while I figure out how to use OneHotEncoder). > >> > >> - [ ] update the docstrings from load_boston() > >> - https://github.com/rhiever/datacleaner/issues/1# > issuecomment-266980937 > >> "I think it illogical to e.g. average Exterior1st in the Kaggle House > >> Prices Dataset: the average of ImStucc and Wd Sdng seems nonsensical?" > >> > >> analysis.py HEAD-1 (before I wrapped it in a class while waiting) looks > >> like it'll be another 4-5 hours on this notebook from 2009. > >> > >> - PCA? There's probably a better way. > >> > >> I added an environment.yml but haven't yet determined the minimal set > for > >> setup.py, so > >> > >> conda env update -f=environment.yml # make condaenvupdate > >> > >> > >> I'm supposed to be at work now. I may not be able to make it this > evening; > >> if not, good luck. > >> > >> I'll add the generated pipeline to the github repo and share the URL. > >> > >> > >> On Fri, Dec 9, 2016 at 12:31 PM, Wes Turner > wrote: > >> > >> > So, we need to mutate and crossover until Mean Squared Error (MSE) is > >> > optimally minimized? > >> > > >> > http://rhiever.github.io/tpot/examples/Boston_Example/ > >> > > >> > Looks like we need something like load_boston() in > >> > https://github.com/scikit-learn/scikit-learn/blob/ > >> > master/sklearn/datasets/base.py > >> > > >> > > >> > https://www.kaggle.com/c/house-prices-advanced-regression- > >> techniques/data > >> > > >> > > >> > On Friday, December 9, 2016, Steve Young via Omaha > >> > wrote: > >> > > >> >> > > >> >> > Sign up for Kaggle - Check. > >> >> > Install Anaconda - Check https://docs.continuum.io/ > anaconda/install > >> >> > Basic familiarity - Check. http://conda.pydata.org > >> >> > /docs/test-drive.html#managing-conda > >> >> > Anaconda cheat sheet - Check. > >> >> > http://conda.pydata.org/docs/using/cheatsheet.html > >> >> > Pycharm and Anaconda - Check. https://www.jetbrains. > >> >> > com/help/pycharm/2016.1/conda-support-creating-conda-environ > >> ment.html > >> >> > > >> >> > Steve > >> >> > > >> >> > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha < > >> omaha at python.org > >> >> > > >> >> > wrote: > >> >> > > >> >> >> Hi All, > >> >> >> > >> >> >> We're all set for the 12/14 group Kaggle competition kickoff! > >> >> >> > >> >> >> All experience levels are welcome. Bring your laptop if you'd > like, > >> >> but > >> >> >> no > >> >> >> biggie if you don't > >> >> >> > >> >> >> I didn't hear any objections to the Housing Prices competition so > >> >> let's go > >> >> >> with that one > >> >> >> https://www.kaggle.com/c/house-prices-advanced-regression- > >> techniques > >> >> >> > >> >> >> Suggested things to do prior to 12/14 > >> >> >> -- Sign up on Kaggle > >> >> >> -- Get your machine set up with some pydata libraries > >> >> >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I recommend > >> the > >> >> >> Anaconda distribution if you're just starting out > >> >> >> -- Get some basic familiarity with the competition problem and > data > >> >> >> > >> >> >> Let me know if you have any questions. > >> >> >> > >> >> >> Thanks! > >> >> >> Bob > >> >> >> > >> >> >> > >> >> >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner < > bob.haffner at gmail.com > >> > > >> >> >> wrote: > >> >> >> > >> >> >> > Good deal. That's 3 of us (Naomi, you and me) by my count. > >> Hopefully > >> >> >> > others will join in!! > >> >> >> > > >> >> >> > I would be game for a December meetup. > >> >> >> > > >> >> >> > Sent from my iPhone > >> >> >> > > >> >> >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha < > >> >> omaha at python.org> > >> >> >> > wrote: > >> >> >> > > > >> >> >> > > I would enjoy participating, and learning what you data guys > and > >> >> gals > >> >> >> do. > >> >> >> > > (I am not a math guy) > >> >> >> > > > >> >> >> > > If Hubert does not take December, maybe we could have a sprint > >> that > >> >> >> > night? > >> >> >> > > > >> >> >> > > Steve > >> >> >> > > > >> >> >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < > >> >> >> omaha at python.org> > >> >> >> > > wrote: > >> >> >> > > > >> >> >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha < > >> >> omaha at python.org > >> >> >> > > >> >> >> > >> wrote: > >> >> >> > >> > >> >> >> > >>> Hi All, > >> >> >> > >>> > >> >> >> > >>> A few months ago someone brought up the idea of doing a > Kaggle > >> >> data > >> >> >> > >> science > >> >> >> > >>> competition as a group. Is there still interest in this? > >> >> >> > >>> > >> >> >> > >>> Some thoughts. > >> >> >> > >>> Not sure of the details, but Kaggle allows individuals to > form > >> >> >> groups. > >> >> >> > >> We > >> >> >> > >>> could collaborate thru email (or perhaps something like > Slack) > >> >> and > >> >> >> > maybe > >> >> >> > >>> meet occasionally. When it's all said and done, we could > >> present > >> >> >> at a > >> >> >> > >>> monthly meeting. > >> >> >> > >> > >> >> >> > >> > >> >> >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH wiki) > could > >> >> also > >> >> >> be > >> >> >> > >> useful: > >> >> >> > >> > >> >> >> > >> - gh-pages branch built from docs/ and nb/ > >> >> >> > >> - .ipynb in notebooks/ or nb/ > >> >> >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ has > >> >> packaging > >> >> >> and > >> >> >> > >> ReadTheDocs config > >> >> >> > >> - > >> >> >> > >> https://github.com/jupyter/docker-stacks/blob/master/ > >> >> >> > >> scipy-notebook/Dockerfile > >> >> >> > >> includes conda > >> >> >> > >> > >> >> >> > >> > >> >> >> > >> > >> >> >> > >>> > >> >> >> > >>> This one looks good. Doesn't end till March 1st which gives > >> us > >> >> some > >> >> >> > time > >> >> >> > >>> and it doesn't look overly complicated. No prize money, > >> though > >> >> :-) > >> >> >> > >>> https://www.kaggle.com/c/house-prices-advanced-regression- > >> >> >> techniques > >> >> >> > >> > >> >> >> > >> > >> >> >> > >> - http://rhiever.github.io/tpot/examples/Boston_Example/ > >> >> >> > >> > >> >> >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle > >> >> competition > >> >> >> > >> description) > >> >> >> > >> > >> >> >> > >> > >> >> >> > >> > >> >> >> > >> - https://github.com/donnemartin/data-science-ipython- > >> notebooks/ > >> >> >> > >> > >> >> >> > >> > >> >> >> > >>> Forming groups > >> >> >> > >>> https://www.kaggle.com/wiki/FormingATeam > >> >> >> > >>> > >> >> >> > >>> Would love to get some feedback on any of this > >> >> >> > >>> > >> >> >> > >>> Thanks, > >> >> >> > >>> Bob > >> >> >> > >>> _______________________________________________ > >> >> >> > >>> Omaha Python Users Group mailing list > >> >> >> > >>> Omaha at python.org > >> >> >> > >>> https://mail.python.org/mailman/listinfo/omaha > >> >> >> > >>> http://www.OmahaPython.org > >> >> >> > >>> > >> >> >> > >> _______________________________________________ > >> >> >> > >> Omaha Python Users Group mailing list > >> >> >> > >> Omaha at python.org > >> >> >> > >> https://mail.python.org/mailman/listinfo/omaha > >> >> >> > >> http://www.OmahaPython.org > >> >> >> > >> > >> >> >> > > _______________________________________________ > >> >> >> > > Omaha Python Users Group mailing list > >> >> >> > > Omaha at python.org > >> >> >> > > https://mail.python.org/mailman/listinfo/omaha > >> >> >> > > http://www.OmahaPython.org > >> >> >> > > >> >> >> _______________________________________________ > >> >> >> Omaha Python Users Group mailing list > >> >> >> Omaha at python.org > >> >> >> https://mail.python.org/mailman/listinfo/omaha > >> >> >> http://www.OmahaPython.org > >> >> >> > >> >> > > >> >> > > >> >> _______________________________________________ > >> >> Omaha Python Users Group mailing list > >> >> Omaha at python.org > >> >> https://mail.python.org/mailman/listinfo/omaha > >> >> http://www.OmahaPython.org > >> >> > >> > > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> https://mail.python.org/mailman/listinfo/omaha > >> http://www.OmahaPython.org > >> > > > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sat Dec 17 14:56:48 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sat, 17 Dec 2016 13:56:48 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Saturday, December 17, 2016, Luke Schollmeyer via Omaha wrote: > Made a quick attempt between middle school basketball games...did nothing > more than pull in all non-categorical variables (rather than the five we > used) and filled NAs with the variable mean. Jumped the score to 0.25857 - > better than Wednesday's but not better than Jeremy's (good job, btw). Not > really that much there to add to the notebook. Does Kaggle take the high mark but still give a score for each submission? Thinking of ways to keep track of which code produced which score; I'll post about the GitHub setup in a bit. > > Next attempt I'll still stick with LR, but will work on the categorical's > and look at better ways to fill missing values (using the mean is a hack). https://github.com/rhiever/datacleaner/blob/master/README.md > Replaces missing values with the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis > > Luke > > On Sat, Dec 17, 2016 at 9:24 AM, Bob Haffner via Omaha > > wrote: > > > We jumped up 58 spots thanks to Jeremy! We are allotted 5 submissions > per > > day so feel free to give it a go > > > > On Wed, Dec 14, 2016 at 10:16 PM, Bob Haffner > > > wrote: > > > > > Good turnout tonight! We managed to form a team, explore some data, > pick > > > some features, fit a model and make a submission. > > > > > > We are currently sitting just outside the top 10 in 2,350th place :-) > > > We'll be climbing the ranks in no time though! > > > > > > Still room for more folks if anyone else is interested. FYI, we're > going > > > to try and meet in January > > > > > > On Wed, Dec 14, 2016 at 12:19 PM, Wes Turner via Omaha < > omaha at python.org > > > > > > wrote: > > > > > >> - https://github.com/westurner/house_prices (BSD 3-Clause, TPOT is > > GPLv3) > > >> - https://github.com/westurner/house_prices/commits/develop > > >> - https://github.com/westurner/house_prices/blob/develop/ > > environment.yml > > >> - > > >> https://github.com/westurner/house_prices/blob/develop/tests > > >> /test_house_prices.py > > >> - > > >> https://github.com/westurner/house_prices/blob/develop/house > > >> _prices/analysis.py > > >> - > > >> https://github.com/westurner/house_prices/blob/develop/house > > >> _prices/data.py > > >> > > >> cookiecutter, hubflow > > >> > > >> data.py loads the data into a sklearn Bunch after pd.do_get_dummies > (and > > >> dataclean.autoclean, while I figure out how to use OneHotEncoder). > > >> > > >> - [ ] update the docstrings from load_boston() > > >> - https://github.com/rhiever/datacleaner/issues/1# > > issuecomment-266980937 > > >> "I think it illogical to e.g. average Exterior1st in the Kaggle > House > > >> Prices Dataset: the average of ImStucc and Wd Sdng seems nonsensical?" > > >> > > >> analysis.py HEAD-1 (before I wrapped it in a class while waiting) > looks > > >> like it'll be another 4-5 hours on this notebook from 2009. > > >> > > >> - PCA? There's probably a better way. > > >> > > >> I added an environment.yml but haven't yet determined the minimal set > > for > > >> setup.py, so > > >> > > >> conda env update -f=environment.yml # make condaenvupdate > > >> > > >> > > >> I'm supposed to be at work now. I may not be able to make it this > > evening; > > >> if not, good luck. > > >> > > >> I'll add the generated pipeline to the github repo and share the URL. > > >> > > >> > > >> On Fri, Dec 9, 2016 at 12:31 PM, Wes Turner > > > wrote: > > >> > > >> > So, we need to mutate and crossover until Mean Squared Error (MSE) > is > > >> > optimally minimized? > > >> > > > >> > http://rhiever.github.io/tpot/examples/Boston_Example/ > > >> > > > >> > Looks like we need something like load_boston() in > > >> > https://github.com/scikit-learn/scikit-learn/blob/ > > >> > master/sklearn/datasets/base.py > > >> > > > >> > > > >> > https://www.kaggle.com/c/house-prices-advanced-regression- > > >> techniques/data > > >> > > > >> > > > >> > On Friday, December 9, 2016, Steve Young via Omaha < > omaha at python.org > > > >> > wrote: > > >> > > > >> >> > > > >> >> > Sign up for Kaggle - Check. > > >> >> > Install Anaconda - Check https://docs.continuum.io/ > > anaconda/install > > >> >> > Basic familiarity - Check. http://conda.pydata.org > > >> >> > /docs/test-drive.html#managing-conda > > >> >> > Anaconda cheat sheet - Check. > > >> >> > http://conda.pydata.org/docs/using/cheatsheet.html > > >> >> > Pycharm and Anaconda - Check. https://www.jetbrains. > > >> >> > com/help/pycharm/2016.1/conda-support-creating-conda-environ > > >> ment.html > > >> >> > > > >> >> > Steve > > >> >> > > > >> >> > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha < > > >> omaha at python.org > > >> >> > > > >> >> > wrote: > > >> >> > > > >> >> >> Hi All, > > >> >> >> > > >> >> >> We're all set for the 12/14 group Kaggle competition kickoff! > > >> >> >> > > >> >> >> All experience levels are welcome. Bring your laptop if you'd > > like, > > >> >> but > > >> >> >> no > > >> >> >> biggie if you don't > > >> >> >> > > >> >> >> I didn't hear any objections to the Housing Prices competition > so > > >> >> let's go > > >> >> >> with that one > > >> >> >> https://www.kaggle.com/c/house-prices-advanced-regression- > > >> techniques > > >> >> >> > > >> >> >> Suggested things to do prior to 12/14 > > >> >> >> -- Sign up on Kaggle > > >> >> >> -- Get your machine set up with some pydata libraries > > >> >> >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I > recommend > > >> the > > >> >> >> Anaconda distribution if you're just starting out > > >> >> >> -- Get some basic familiarity with the competition problem and > > data > > >> >> >> > > >> >> >> Let me know if you have any questions. > > >> >> >> > > >> >> >> Thanks! > > >> >> >> Bob > > >> >> >> > > >> >> >> > > >> >> >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner < > > bob.haffner at gmail.com > > >> > > > >> >> >> wrote: > > >> >> >> > > >> >> >> > Good deal. That's 3 of us (Naomi, you and me) by my count. > > >> Hopefully > > >> >> >> > others will join in!! > > >> >> >> > > > >> >> >> > I would be game for a December meetup. > > >> >> >> > > > >> >> >> > Sent from my iPhone > > >> >> >> > > > >> >> >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha < > > >> >> omaha at python.org > > > >> >> >> > wrote: > > >> >> >> > > > > >> >> >> > > I would enjoy participating, and learning what you data guys > > and > > >> >> gals > > >> >> >> do. > > >> >> >> > > (I am not a math guy) > > >> >> >> > > > > >> >> >> > > If Hubert does not take December, maybe we could have a > sprint > > >> that > > >> >> >> > night? > > >> >> >> > > > > >> >> >> > > Steve > > >> >> >> > > > > >> >> >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < > > >> >> >> omaha at python.org > > > >> >> >> > > wrote: > > >> >> >> > > > > >> >> >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha < > > >> >> omaha at python.org > > >> >> >> > > > >> >> >> > >> wrote: > > >> >> >> > >> > > >> >> >> > >>> Hi All, > > >> >> >> > >>> > > >> >> >> > >>> A few months ago someone brought up the idea of doing a > > Kaggle > > >> >> data > > >> >> >> > >> science > > >> >> >> > >>> competition as a group. Is there still interest in this? > > >> >> >> > >>> > > >> >> >> > >>> Some thoughts. > > >> >> >> > >>> Not sure of the details, but Kaggle allows individuals to > > form > > >> >> >> groups. > > >> >> >> > >> We > > >> >> >> > >>> could collaborate thru email (or perhaps something like > > Slack) > > >> >> and > > >> >> >> > maybe > > >> >> >> > >>> meet occasionally. When it's all said and done, we could > > >> present > > >> >> >> at a > > >> >> >> > >>> monthly meeting. > > >> >> >> > >> > > >> >> >> > >> > > >> >> >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH wiki) > > could > > >> >> also > > >> >> >> be > > >> >> >> > >> useful: > > >> >> >> > >> > > >> >> >> > >> - gh-pages branch built from docs/ and nb/ > > >> >> >> > >> - .ipynb in notebooks/ or nb/ > > >> >> >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ has > > >> >> packaging > > >> >> >> and > > >> >> >> > >> ReadTheDocs config > > >> >> >> > >> - > > >> >> >> > >> https://github.com/jupyter/docker-stacks/blob/master/ > > >> >> >> > >> scipy-notebook/Dockerfile > > >> >> >> > >> includes conda > > >> >> >> > >> > > >> >> >> > >> > > >> >> >> > >> > > >> >> >> > >>> > > >> >> >> > >>> This one looks good. Doesn't end till March 1st which > gives > > >> us > > >> >> some > > >> >> >> > time > > >> >> >> > >>> and it doesn't look overly complicated. No prize money, > > >> though > > >> >> :-) > > >> >> >> > >>> https://www.kaggle.com/c/house-prices-advanced- > regression- > > >> >> >> techniques > > >> >> >> > >> > > >> >> >> > >> > > >> >> >> > >> - http://rhiever.github.io/tpot/examples/Boston_Example/ > > >> >> >> > >> > > >> >> >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle > > >> >> competition > > >> >> >> > >> description) > > >> >> >> > >> > > >> >> >> > >> > > >> >> >> > >> > > >> >> >> > >> - https://github.com/donnemartin/data-science-ipython- > > >> notebooks/ > > >> >> >> > >> > > >> >> >> > >> > > >> >> >> > >>> Forming groups > > >> >> >> > >>> https://www.kaggle.com/wiki/FormingATeam > > >> >> >> > >>> > > >> >> >> > >>> Would love to get some feedback on any of this > > >> >> >> > >>> > > >> >> >> > >>> Thanks, > > >> >> >> > >>> Bob > > >> >> >> > >>> _______________________________________________ > > >> >> >> > >>> Omaha Python Users Group mailing list > > >> >> >> > >>> Omaha at python.org > > >> >> >> > >>> https://mail.python.org/mailman/listinfo/omaha > > >> >> >> > >>> http://www.OmahaPython.org > > >> >> >> > >>> > > >> >> >> > >> _______________________________________________ > > >> >> >> > >> Omaha Python Users Group mailing list > > >> >> >> > >> Omaha at python.org > > >> >> >> > >> https://mail.python.org/mailman/listinfo/omaha > > >> >> >> > >> http://www.OmahaPython.org > > >> >> >> > >> > > >> >> >> > > _______________________________________________ > > >> >> >> > > Omaha Python Users Group mailing list > > >> >> >> > > Omaha at python.org > > >> >> >> > > https://mail.python.org/mailman/listinfo/omaha > > >> >> >> > > http://www.OmahaPython.org > > >> >> >> > > > >> >> >> _______________________________________________ > > >> >> >> Omaha Python Users Group mailing list > > >> >> >> Omaha at python.org > > >> >> >> https://mail.python.org/mailman/listinfo/omaha > > >> >> >> http://www.OmahaPython.org > > >> >> >> > > >> >> > > > >> >> > > > >> >> _______________________________________________ > > >> >> Omaha Python Users Group mailing list > > >> >> Omaha at python.org > > >> >> https://mail.python.org/mailman/listinfo/omaha > > >> >> http://www.OmahaPython.org > > >> >> > > >> > > > >> _______________________________________________ > > >> Omaha Python Users Group mailing list > > >> Omaha at python.org > > >> https://mail.python.org/mailman/listinfo/omaha > > >> http://www.OmahaPython.org > > >> > > > > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sat Dec 17 15:24:58 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sat, 17 Dec 2016 14:24:58 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames Message-ID: I've created: - an omahapython github organization account: https://github.com/omahapython - an @omahapython/datascience team: https://github.com/orgs/omahapython/teams/datascience - an omahapython/datascience repository https://github.com/omahapython/datascience- - omahapython/datascience#3: "Kaggle Best Practices" https://github.com/omahapython/datascience/issues/3 - an omahapython/kaggle-houseprices repository https://github.com/omahapython/kaggle-houseprices - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" https://github.com/omahapython/kaggle-houseprices/issues/1 REQ (Request): Please reply with your github username if you want to be added to the omahapython org and/or the omahapython/datascience team. All I need is either: username, omahapython username, omahapython, @omahapython/datascience > "Use @omahapython/datascience to mention this team in comments." The @omahapython/datascience team has write access to kaggle-houseprices (where I'll soon create the recommended kaggle competition folder structure compiled in omahapython/datascience#3). From bob.haffner at gmail.com Sat Dec 17 15:39:38 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Sat, 17 Dec 2016 14:39:38 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: >Does Kaggle take the high mark but still give a score for each submission? Yes. https://www.kaggle.com/c/house-prices-advanced-regression-techniques/submissions >Thinking of ways to keep track of which code produced which score; I'll >post about the GitHub setup in a bit. We could push our notebooks to the github repo? Maybe include a brief description at the top in a markdown cell I initially thought github was a good way to go, but I don't know if everyone has a github acct or is interested in starting one. Maybe email is the way to go? On Sat, Dec 17, 2016 at 1:56 PM, Wes Turner via Omaha wrote: > On Saturday, December 17, 2016, Luke Schollmeyer via Omaha < > omaha at python.org> > wrote: > > > Made a quick attempt between middle school basketball games...did nothing > > more than pull in all non-categorical variables (rather than the five we > > used) and filled NAs with the variable mean. Jumped the score to 0.25857 > - > > better than Wednesday's but not better than Jeremy's (good job, btw). Not > > really that much there to add to the notebook. > > > Does Kaggle take the high mark but still give a score for each submission? > > Thinking of ways to keep track of which code produced which score; I'll > post about the GitHub setup in a bit. > > > > > > Next attempt I'll still stick with LR, but will work on the categorical's > > and look at better ways to fill missing values (using the mean is a > hack). > > > https://github.com/rhiever/datacleaner/blob/master/README.md > > > Replaces missing values with the mode (for categorical variables) or > median (for continuous variables) on a column-by-column basis > > > > > > Luke > > > > On Sat, Dec 17, 2016 at 9:24 AM, Bob Haffner via Omaha > > > > wrote: > > > > > We jumped up 58 spots thanks to Jeremy! We are allotted 5 submissions > > per > > > day so feel free to give it a go > > > > > > On Wed, Dec 14, 2016 at 10:16 PM, Bob Haffner > > > > > wrote: > > > > > > > Good turnout tonight! We managed to form a team, explore some data, > > pick > > > > some features, fit a model and make a submission. > > > > > > > > We are currently sitting just outside the top 10 in 2,350th place :-) > > > > We'll be climbing the ranks in no time though! > > > > > > > > Still room for more folks if anyone else is interested. FYI, we're > > going > > > > to try and meet in January > > > > > > > > On Wed, Dec 14, 2016 at 12:19 PM, Wes Turner via Omaha < > > omaha at python.org > > > > > > > > wrote: > > > > > > > >> - https://github.com/westurner/house_prices (BSD 3-Clause, TPOT is > > > GPLv3) > > > >> - https://github.com/westurner/house_prices/commits/develop > > > >> - https://github.com/westurner/house_prices/blob/develop/ > > > environment.yml > > > >> - > > > >> https://github.com/westurner/house_prices/blob/develop/tests > > > >> /test_house_prices.py > > > >> - > > > >> https://github.com/westurner/house_prices/blob/develop/house > > > >> _prices/analysis.py > > > >> - > > > >> https://github.com/westurner/house_prices/blob/develop/house > > > >> _prices/data.py > > > >> > > > >> cookiecutter, hubflow > > > >> > > > >> data.py loads the data into a sklearn Bunch after pd.do_get_dummies > > (and > > > >> dataclean.autoclean, while I figure out how to use OneHotEncoder). > > > >> > > > >> - [ ] update the docstrings from load_boston() > > > >> - https://github.com/rhiever/datacleaner/issues/1# > > > issuecomment-266980937 > > > >> "I think it illogical to e.g. average Exterior1st in the Kaggle > > House > > > >> Prices Dataset: the average of ImStucc and Wd Sdng seems > nonsensical?" > > > >> > > > >> analysis.py HEAD-1 (before I wrapped it in a class while waiting) > > looks > > > >> like it'll be another 4-5 hours on this notebook from 2009. > > > >> > > > >> - PCA? There's probably a better way. > > > >> > > > >> I added an environment.yml but haven't yet determined the minimal > set > > > for > > > >> setup.py, so > > > >> > > > >> conda env update -f=environment.yml # make condaenvupdate > > > >> > > > >> > > > >> I'm supposed to be at work now. I may not be able to make it this > > > evening; > > > >> if not, good luck. > > > >> > > > >> I'll add the generated pipeline to the github repo and share the > URL. > > > >> > > > >> > > > >> On Fri, Dec 9, 2016 at 12:31 PM, Wes Turner > > > > > wrote: > > > >> > > > >> > So, we need to mutate and crossover until Mean Squared Error (MSE) > > is > > > >> > optimally minimized? > > > >> > > > > >> > http://rhiever.github.io/tpot/examples/Boston_Example/ > > > >> > > > > >> > Looks like we need something like load_boston() in > > > >> > https://github.com/scikit-learn/scikit-learn/blob/ > > > >> > master/sklearn/datasets/base.py > > > >> > > > > >> > > > > >> > https://www.kaggle.com/c/house-prices-advanced-regression- > > > >> techniques/data > > > >> > > > > >> > > > > >> > On Friday, December 9, 2016, Steve Young via Omaha < > > omaha at python.org > > > > >> > wrote: > > > >> > > > > >> >> > > > > >> >> > Sign up for Kaggle - Check. > > > >> >> > Install Anaconda - Check https://docs.continuum.io/ > > > anaconda/install > > > >> >> > Basic familiarity - Check. http://conda.pydata.org > > > >> >> > /docs/test-drive.html#managing-conda > > > >> >> > Anaconda cheat sheet - Check. > > > >> >> > http://conda.pydata.org/docs/using/cheatsheet.html > > > >> >> > Pycharm and Anaconda - Check. https://www.jetbrains. > > > >> >> > com/help/pycharm/2016.1/conda-support-creating-conda-environ > > > >> ment.html > > > >> >> > > > > >> >> > Steve > > > >> >> > > > > >> >> > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha < > > > >> omaha at python.org > > > >> >> > > > > >> >> > wrote: > > > >> >> > > > > >> >> >> Hi All, > > > >> >> >> > > > >> >> >> We're all set for the 12/14 group Kaggle competition kickoff! > > > >> >> >> > > > >> >> >> All experience levels are welcome. Bring your laptop if you'd > > > like, > > > >> >> but > > > >> >> >> no > > > >> >> >> biggie if you don't > > > >> >> >> > > > >> >> >> I didn't hear any objections to the Housing Prices competition > > so > > > >> >> let's go > > > >> >> >> with that one > > > >> >> >> https://www.kaggle.com/c/house-prices-advanced-regression- > > > >> techniques > > > >> >> >> > > > >> >> >> Suggested things to do prior to 12/14 > > > >> >> >> -- Sign up on Kaggle > > > >> >> >> -- Get your machine set up with some pydata libraries > > > >> >> >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I > > recommend > > > >> the > > > >> >> >> Anaconda distribution if you're just starting out > > > >> >> >> -- Get some basic familiarity with the competition problem and > > > data > > > >> >> >> > > > >> >> >> Let me know if you have any questions. > > > >> >> >> > > > >> >> >> Thanks! > > > >> >> >> Bob > > > >> >> >> > > > >> >> >> > > > >> >> >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner < > > > bob.haffner at gmail.com > > > >> > > > > >> >> >> wrote: > > > >> >> >> > > > >> >> >> > Good deal. That's 3 of us (Naomi, you and me) by my count. > > > >> Hopefully > > > >> >> >> > others will join in!! > > > >> >> >> > > > > >> >> >> > I would be game for a December meetup. > > > >> >> >> > > > > >> >> >> > Sent from my iPhone > > > >> >> >> > > > > >> >> >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha < > > > >> >> omaha at python.org > > > > >> >> >> > wrote: > > > >> >> >> > > > > > >> >> >> > > I would enjoy participating, and learning what you data > guys > > > and > > > >> >> gals > > > >> >> >> do. > > > >> >> >> > > (I am not a math guy) > > > >> >> >> > > > > > >> >> >> > > If Hubert does not take December, maybe we could have a > > sprint > > > >> that > > > >> >> >> > night? > > > >> >> >> > > > > > >> >> >> > > Steve > > > >> >> >> > > > > > >> >> >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < > > > >> >> >> omaha at python.org > > > > >> >> >> > > wrote: > > > >> >> >> > > > > > >> >> >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha < > > > >> >> omaha at python.org > > > >> >> >> > > > > >> >> >> > >> wrote: > > > >> >> >> > >> > > > >> >> >> > >>> Hi All, > > > >> >> >> > >>> > > > >> >> >> > >>> A few months ago someone brought up the idea of doing a > > > Kaggle > > > >> >> data > > > >> >> >> > >> science > > > >> >> >> > >>> competition as a group. Is there still interest in > this? > > > >> >> >> > >>> > > > >> >> >> > >>> Some thoughts. > > > >> >> >> > >>> Not sure of the details, but Kaggle allows individuals > to > > > form > > > >> >> >> groups. > > > >> >> >> > >> We > > > >> >> >> > >>> could collaborate thru email (or perhaps something like > > > Slack) > > > >> >> and > > > >> >> >> > maybe > > > >> >> >> > >>> meet occasionally. When it's all said and done, we > could > > > >> present > > > >> >> >> at a > > > >> >> >> > >>> monthly meeting. > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH wiki) > > > could > > > >> >> also > > > >> >> >> be > > > >> >> >> > >> useful: > > > >> >> >> > >> > > > >> >> >> > >> - gh-pages branch built from docs/ and nb/ > > > >> >> >> > >> - .ipynb in notebooks/ or nb/ > > > >> >> >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ has > > > >> >> packaging > > > >> >> >> and > > > >> >> >> > >> ReadTheDocs config > > > >> >> >> > >> - > > > >> >> >> > >> https://github.com/jupyter/docker-stacks/blob/master/ > > > >> >> >> > >> scipy-notebook/Dockerfile > > > >> >> >> > >> includes conda > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >>> > > > >> >> >> > >>> This one looks good. Doesn't end till March 1st which > > gives > > > >> us > > > >> >> some > > > >> >> >> > time > > > >> >> >> > >>> and it doesn't look overly complicated. No prize money, > > > >> though > > > >> >> :-) > > > >> >> >> > >>> https://www.kaggle.com/c/house-prices-advanced- > > regression- > > > >> >> >> techniques > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> - http://rhiever.github.io/tpot/examples/Boston_Example/ > > > >> >> >> > >> > > > >> >> >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle > > > >> >> competition > > > >> >> >> > >> description) > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> - https://github.com/donnemartin/data-science-ipython- > > > >> notebooks/ > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >>> Forming groups > > > >> >> >> > >>> https://www.kaggle.com/wiki/FormingATeam > > > >> >> >> > >>> > > > >> >> >> > >>> Would love to get some feedback on any of this > > > >> >> >> > >>> > > > >> >> >> > >>> Thanks, > > > >> >> >> > >>> Bob > > > >> >> >> > >>> _______________________________________________ > > > >> >> >> > >>> Omaha Python Users Group mailing list > > > >> >> >> > >>> Omaha at python.org > > > >> >> >> > >>> https://mail.python.org/mailman/listinfo/omaha > > > >> >> >> > >>> http://www.OmahaPython.org > > > >> >> >> > >>> > > > >> >> >> > >> _______________________________________________ > > > >> >> >> > >> Omaha Python Users Group mailing list > > > >> >> >> > >> Omaha at python.org > > > >> >> >> > >> https://mail.python.org/mailman/listinfo/omaha > > > >> >> >> > >> http://www.OmahaPython.org > > > >> >> >> > >> > > > >> >> >> > > _______________________________________________ > > > >> >> >> > > Omaha Python Users Group mailing list > > > >> >> >> > > Omaha at python.org > > > >> >> >> > > https://mail.python.org/mailman/listinfo/omaha > > > >> >> >> > > http://www.OmahaPython.org > > > >> >> >> > > > > >> >> >> _______________________________________________ > > > >> >> >> Omaha Python Users Group mailing list > > > >> >> >> Omaha at python.org > > > >> >> >> https://mail.python.org/mailman/listinfo/omaha > > > >> >> >> http://www.OmahaPython.org > > > >> >> >> > > > >> >> > > > > >> >> > > > > >> >> _______________________________________________ > > > >> >> Omaha Python Users Group mailing list > > > >> >> Omaha at python.org > > > >> >> https://mail.python.org/mailman/listinfo/omaha > > > >> >> http://www.OmahaPython.org > > > >> >> > > > >> > > > > >> _______________________________________________ > > > >> Omaha Python Users Group mailing list > > > >> Omaha at python.org > > > >> https://mail.python.org/mailman/listinfo/omaha > > > >> http://www.OmahaPython.org > > > >> > > > > > > > > > > > _______________________________________________ > > > Omaha Python Users Group mailing list > > > Omaha at python.org > > > https://mail.python.org/mailman/listinfo/omaha > > > http://www.OmahaPython.org > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sat Dec 17 16:25:28 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sat, 17 Dec 2016 15:25:28 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha wrote: > >Does Kaggle take the high mark but still give a score for each submission? > Yes. > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/ > submissions > > > >Thinking of ways to keep track of which code produced which score; I'll > >post about the GitHub setup in a bit. > We could push our notebooks to the github repo? Maybe include a brief > description at the top in a markdown cell > In my research [1], I found that the preferred folder structure for kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/ (outputs); and that they recommend creating a settings.json with path configuration (e.g. pointing to input/, src/ data/) So, we could put notebooks, folders, and repos in src/ [2]. runipy is a bit more scriptable than requiring notebook gui interactions [3]. We could either hardcode '../input/test.csv' in our .py and .ipnb sources, or we could write a function in src/data.py to read '../settings.json' into a dict with the recommended variable names [1]: from data import read_settings_json settings = read_settings_json() train = pd.read_csv(settings['TRAIN_DATA_PATH']) # .... pd.write_csv(settings['SUBMISSION_PATH']) [1] https://github.com/omahapython/datascience/issues/3#issuecomment-267236556 [2] https://github.com/omahapython/kaggle-houseprices/tree/master/src [3] https://pypi.python.org/pypi/runipy > > I initially thought github was a good way to go, but I don't know if > everyone has a github acct or is interested in starting one. Maybe email > is the way to go? > I'm all for GitHub: - git source control and revision numbers - we're not able to easily share code in the mailing list - we can learn from each others' solutions > > On Sat, Dec 17, 2016 at 1:56 PM, Wes Turner via Omaha > wrote: > > > On Saturday, December 17, 2016, Luke Schollmeyer via Omaha < > > omaha at python.org> > > wrote: > > > > > Made a quick attempt between middle school basketball games...did > nothing > > > more than pull in all non-categorical variables (rather than the five > we > > > used) and filled NAs with the variable mean. Jumped the score to > 0.25857 > > - > > > better than Wednesday's but not better than Jeremy's (good job, btw). > Not > > > really that much there to add to the notebook. > > > > > > Does Kaggle take the high mark but still give a score for each > submission? > > > > Thinking of ways to keep track of which code produced which score; I'll > > post about the GitHub setup in a bit. > > > > > > > > > > Next attempt I'll still stick with LR, but will work on the > categorical's > > > and look at better ways to fill missing values (using the mean is a > > hack). > > > > > > https://github.com/rhiever/datacleaner/blob/master/README.md > > > > > Replaces missing values with the mode (for categorical variables) or > > median (for continuous variables) on a column-by-column basis > > > > > > > > > > Luke > > > > > > On Sat, Dec 17, 2016 at 9:24 AM, Bob Haffner via Omaha < > omaha at python.org > > > > > > > wrote: > > > > > > > We jumped up 58 spots thanks to Jeremy! We are allotted 5 > submissions > > > per > > > > day so feel free to give it a go > > > > > > > > On Wed, Dec 14, 2016 at 10:16 PM, Bob Haffner > > > > > > > wrote: > > > > > > > > > Good turnout tonight! We managed to form a team, explore some > data, > > > pick > > > > > some features, fit a model and make a submission. > > > > > > > > > > We are currently sitting just outside the top 10 in 2,350th place > :-) > > > > > We'll be climbing the ranks in no time though! > > > > > > > > > > Still room for more folks if anyone else is interested. FYI, we're > > > going > > > > > to try and meet in January > > > > > > > > > > On Wed, Dec 14, 2016 at 12:19 PM, Wes Turner via Omaha < > > > omaha at python.org > > > > > > > > > > wrote: > > > > > > > > > >> - https://github.com/westurner/house_prices (BSD 3-Clause, TPOT > is > > > > GPLv3) > > > > >> - https://github.com/westurner/house_prices/commits/develop > > > > >> - https://github.com/westurner/house_prices/blob/develop/ > > > > environment.yml > > > > >> - > > > > >> https://github.com/westurner/house_prices/blob/develop/tests > > > > >> /test_house_prices.py > > > > >> - > > > > >> https://github.com/westurner/house_prices/blob/develop/house > > > > >> _prices/analysis.py > > > > >> - > > > > >> https://github.com/westurner/house_prices/blob/develop/house > > > > >> _prices/data.py > > > > >> > > > > >> cookiecutter, hubflow > > > > >> > > > > >> data.py loads the data into a sklearn Bunch after > pd.do_get_dummies > > > (and > > > > >> dataclean.autoclean, while I figure out how to use OneHotEncoder). > > > > >> > > > > >> - [ ] update the docstrings from load_boston() > > > > >> - https://github.com/rhiever/datacleaner/issues/1# > > > > issuecomment-266980937 > > > > >> "I think it illogical to e.g. average Exterior1st in the Kaggle > > > House > > > > >> Prices Dataset: the average of ImStucc and Wd Sdng seems > > nonsensical?" > > > > >> > > > > >> analysis.py HEAD-1 (before I wrapped it in a class while waiting) > > > looks > > > > >> like it'll be another 4-5 hours on this notebook from 2009. > > > > >> > > > > >> - PCA? There's probably a better way. > > > > >> > > > > >> I added an environment.yml but haven't yet determined the minimal > > set > > > > for > > > > >> setup.py, so > > > > >> > > > > >> conda env update -f=environment.yml # make condaenvupdate > > > > >> > > > > >> > > > > >> I'm supposed to be at work now. I may not be able to make it this > > > > evening; > > > > >> if not, good luck. > > > > >> > > > > >> I'll add the generated pipeline to the github repo and share the > > URL. > > > > >> > > > > >> > > > > >> On Fri, Dec 9, 2016 at 12:31 PM, Wes Turner > > > > > > > wrote: > > > > >> > > > > >> > So, we need to mutate and crossover until Mean Squared Error > (MSE) > > > is > > > > >> > optimally minimized? > > > > >> > > > > > >> > http://rhiever.github.io/tpot/examples/Boston_Example/ > > > > >> > > > > > >> > Looks like we need something like load_boston() in > > > > >> > https://github.com/scikit-learn/scikit-learn/blob/ > > > > >> > master/sklearn/datasets/base.py > > > > >> > > > > > >> > > > > > >> > https://www.kaggle.com/c/house-prices-advanced-regression- > > > > >> techniques/data > > > > >> > > > > > >> > > > > > >> > On Friday, December 9, 2016, Steve Young via Omaha < > > > omaha at python.org > > > > > >> > wrote: > > > > >> > > > > > >> >> > > > > > >> >> > Sign up for Kaggle - Check. > > > > >> >> > Install Anaconda - Check https://docs.continuum.io/ > > > > anaconda/install > > > > >> >> > Basic familiarity - Check. http://conda.pydata.org > > > > >> >> > /docs/test-drive.html#managing-conda > > > > >> >> > Anaconda cheat sheet - Check. > > > > >> >> > http://conda.pydata.org/docs/using/cheatsheet.html > > > > >> >> > Pycharm and Anaconda - Check. https://www.jetbrains. > > > > >> >> > com/help/pycharm/2016.1/conda-support-creating-conda-environ > > > > >> ment.html > > > > >> >> > > > > > >> >> > Steve > > > > >> >> > > > > > >> >> > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha < > > > > >> omaha at python.org > > > > >> >> > > > > > >> >> > wrote: > > > > >> >> > > > > > >> >> >> Hi All, > > > > >> >> >> > > > > >> >> >> We're all set for the 12/14 group Kaggle competition > kickoff! > > > > >> >> >> > > > > >> >> >> All experience levels are welcome. Bring your laptop if > you'd > > > > like, > > > > >> >> but > > > > >> >> >> no > > > > >> >> >> biggie if you don't > > > > >> >> >> > > > > >> >> >> I didn't hear any objections to the Housing Prices > competition > > > so > > > > >> >> let's go > > > > >> >> >> with that one > > > > >> >> >> https://www.kaggle.com/c/house-prices-advanced-regression- > > > > >> techniques > > > > >> >> >> > > > > >> >> >> Suggested things to do prior to 12/14 > > > > >> >> >> -- Sign up on Kaggle > > > > >> >> >> -- Get your machine set up with some pydata libraries > > > > >> >> >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I > > > recommend > > > > >> the > > > > >> >> >> Anaconda distribution if you're just starting out > > > > >> >> >> -- Get some basic familiarity with the competition problem > and > > > > data > > > > >> >> >> > > > > >> >> >> Let me know if you have any questions. > > > > >> >> >> > > > > >> >> >> Thanks! > > > > >> >> >> Bob > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner < > > > > bob.haffner at gmail.com > > > > >> > > > > > >> >> >> wrote: > > > > >> >> >> > > > > >> >> >> > Good deal. That's 3 of us (Naomi, you and me) by my count. > > > > >> Hopefully > > > > >> >> >> > others will join in!! > > > > >> >> >> > > > > > >> >> >> > I would be game for a December meetup. > > > > >> >> >> > > > > > >> >> >> > Sent from my iPhone > > > > >> >> >> > > > > > >> >> >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha < > > > > >> >> omaha at python.org > > > > > >> >> >> > wrote: > > > > >> >> >> > > > > > > >> >> >> > > I would enjoy participating, and learning what you data > > guys > > > > and > > > > >> >> gals > > > > >> >> >> do. > > > > >> >> >> > > (I am not a math guy) > > > > >> >> >> > > > > > > >> >> >> > > If Hubert does not take December, maybe we could have a > > > sprint > > > > >> that > > > > >> >> >> > night? > > > > >> >> >> > > > > > > >> >> >> > > Steve > > > > >> >> >> > > > > > > >> >> >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < > > > > >> >> >> omaha at python.org > > > > > >> >> >> > > wrote: > > > > >> >> >> > > > > > > >> >> >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha < > > > > >> >> omaha at python.org > > > > >> >> >> > > > > > >> >> >> > >> wrote: > > > > >> >> >> > >> > > > > >> >> >> > >>> Hi All, > > > > >> >> >> > >>> > > > > >> >> >> > >>> A few months ago someone brought up the idea of doing > a > > > > Kaggle > > > > >> >> data > > > > >> >> >> > >> science > > > > >> >> >> > >>> competition as a group. Is there still interest in > > this? > > > > >> >> >> > >>> > > > > >> >> >> > >>> Some thoughts. > > > > >> >> >> > >>> Not sure of the details, but Kaggle allows individuals > > to > > > > form > > > > >> >> >> groups. > > > > >> >> >> > >> We > > > > >> >> >> > >>> could collaborate thru email (or perhaps something > like > > > > Slack) > > > > >> >> and > > > > >> >> >> > maybe > > > > >> >> >> > >>> meet occasionally. When it's all said and done, we > > could > > > > >> present > > > > >> >> >> at a > > > > >> >> >> > >>> monthly meeting. > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH > wiki) > > > > could > > > > >> >> also > > > > >> >> >> be > > > > >> >> >> > >> useful: > > > > >> >> >> > >> > > > > >> >> >> > >> - gh-pages branch built from docs/ and nb/ > > > > >> >> >> > >> - .ipynb in notebooks/ or nb/ > > > > >> >> >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ > has > > > > >> >> packaging > > > > >> >> >> and > > > > >> >> >> > >> ReadTheDocs config > > > > >> >> >> > >> - > > > > >> >> >> > >> https://github.com/jupyter/docker-stacks/blob/master/ > > > > >> >> >> > >> scipy-notebook/Dockerfile > > > > >> >> >> > >> includes conda > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> > >>> > > > > >> >> >> > >>> This one looks good. Doesn't end till March 1st which > > > gives > > > > >> us > > > > >> >> some > > > > >> >> >> > time > > > > >> >> >> > >>> and it doesn't look overly complicated. No prize > money, > > > > >> though > > > > >> >> :-) > > > > >> >> >> > >>> https://www.kaggle.com/c/house-prices-advanced- > > > regression- > > > > >> >> >> techniques > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> > >> - http://rhiever.github.io/tpot/ > examples/Boston_Example/ > > > > >> >> >> > >> > > > > >> >> >> > >> - TPOT can utilize XGBoost (as mentioned in the Kaggle > > > > >> >> competition > > > > >> >> >> > >> description) > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> > >> - https://github.com/donnemartin/data-science-ipython- > > > > >> notebooks/ > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> > >>> Forming groups > > > > >> >> >> > >>> https://www.kaggle.com/wiki/FormingATeam > > > > >> >> >> > >>> > > > > >> >> >> > >>> Would love to get some feedback on any of this > > > > >> >> >> > >>> > > > > >> >> >> > >>> Thanks, > > > > >> >> >> > >>> Bob > > > > >> >> >> > >>> _______________________________________________ > > > > >> >> >> > >>> Omaha Python Users Group mailing list > > > > >> >> >> > >>> Omaha at python.org > > > > >> >> >> > >>> https://mail.python.org/mailman/listinfo/omaha > > > > >> >> >> > >>> http://www.OmahaPython.org > > > > >> >> >> > >>> > > > > >> >> >> > >> _______________________________________________ > > > > >> >> >> > >> Omaha Python Users Group mailing list > > > > >> >> >> > >> Omaha at python.org > > > > >> >> >> > >> https://mail.python.org/mailman/listinfo/omaha > > > > >> >> >> > >> http://www.OmahaPython.org > > > > >> >> >> > >> > > > > >> >> >> > > _______________________________________________ > > > > >> >> >> > > Omaha Python Users Group mailing list > > > > >> >> >> > > Omaha at python.org > > > > >> >> >> > > https://mail.python.org/mailman/listinfo/omaha > > > > >> >> >> > > http://www.OmahaPython.org > > > > >> >> >> > > > > > >> >> >> _______________________________________________ > > > > >> >> >> Omaha Python Users Group mailing list > > > > >> >> >> Omaha at python.org > > > > >> >> >> https://mail.python.org/mailman/listinfo/omaha > > > > >> >> >> http://www.OmahaPython.org > > > > >> >> >> > > > > >> >> > > > > > >> >> > > > > > >> >> _______________________________________________ > > > > >> >> Omaha Python Users Group mailing list > > > > >> >> Omaha at python.org > > > > >> >> https://mail.python.org/mailman/listinfo/omaha > > > > >> >> http://www.OmahaPython.org > > > > >> >> > > > > >> > > > > > >> _______________________________________________ > > > > >> Omaha Python Users Group mailing list > > > > >> Omaha at python.org > > > > >> https://mail.python.org/mailman/listinfo/omaha > > > > >> http://www.OmahaPython.org > > > > >> > > > > > > > > > > > > > > _______________________________________________ > > > > Omaha Python Users Group mailing list > > > > Omaha at python.org > > > > https://mail.python.org/mailman/listinfo/omaha > > > > http://www.OmahaPython.org > > > > > > > _______________________________________________ > > > Omaha Python Users Group mailing list > > > Omaha at python.org > > > https://mail.python.org/mailman/listinfo/omaha > > > http://www.OmahaPython.org > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sat Dec 17 16:28:51 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sat, 17 Dec 2016 15:28:51 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner wrote: > > > On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha > wrote: > >> >Does Kaggle take the high mark but still give a score for each >> submission? >> Yes. >> https://www.kaggle.com/c/house-prices-advanced-regression- >> techniques/submissions >> >> >> >Thinking of ways to keep track of which code produced which score; I'll >> >post about the GitHub setup in a bit. >> We could push our notebooks to the github repo? Maybe include a brief >> description at the top in a markdown cell >> > > In my research [1], I found that the preferred folder structure for kaggle > is input/ (data), src/ (.py, .ipnb notebooks), and working/ (outputs); > and that they recommend creating a settings.json with path configuration > (e.g. pointing to input/, src/ data/) > > So, we could put notebooks, folders, and repos in src/ [2]. > > runipy is a bit more scriptable than requiring notebook gui interactions > [3]. > > We could either hardcode '../input/test.csv' in our .py and .ipnb sources, > or we could write a function in src/data.py to read '../settings.json' into > a dict with the recommended variable names [1]: > > from data import read_settings_json > settings = read_settings_json() > train = pd.read_csv(settings['TRAIN_DATA_PATH']) > # .... > pd.write_csv(settings['SUBMISSION_PATH']) > > [1] https://github.com/omahapython/datascience/issues/3#issuecomment- > 267236556 > [2] https://github.com/omahapython/kaggle-houseprices/tree/master/src > [3] https://pypi.python.org/pypi/runipy > > >> >> I initially thought github was a good way to go, but I don't know if >> everyone has a github acct or is interested in starting one. Maybe email >> is the way to go? >> > > I'm all for GitHub: > > - git source control and revision numbers > - we're not able to easily share code in the mailing list > - we can learn from each others' solutions > An example of mailing list limitations: Your mail to 'Omaha' with the subject Re: [omaha] Group Data Science Competition Is being held until the list moderator can review it for approval. The reason it is being held: Message body is too big: 47004 bytes with a limit of 40 KB (I trimmed out the reply chain; so this may make it through first) From bob.haffner at gmail.com Sat Dec 17 17:39:03 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Sat, 17 Dec 2016 16:39:03 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Hey all, regarding our January kaggle meetup that we talked about. Maybe we can meet following our regular monthly (1/18). Would that be easier/better for everyone? On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner wrote: > Just submitted another Linear Regression attempt (0.16136). Added some > features, both numeric and categorical, and created 3 numerics > > -TotalFullBaths > -TotalHalfBaths > -Pool > > Notebook attached > > > > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner > wrote: > >> Just submitted another Linear Regression attempt (0.16136). Added some >> features, both numeric and categorical, and created 3 numerics >> >> -TotalFullBaths >> -TotalHalfBaths >> -Pool >> >> Notebook attached >> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner wrote: >> >>> >>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner >>> wrote: >>> >>>> >>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >>>> omaha at python.org> wrote: >>>> >>>>> >Does Kaggle take the high mark but still give a score for each >>>>> submission? >>>>> Yes. >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >>>>> chniques/submissions >>>>> >>>>> >>>>> >Thinking of ways to keep track of which code produced which score; >>>>> I'll >>>>> >post about the GitHub setup in a bit. >>>>> We could push our notebooks to the github repo? Maybe include a brief >>>>> description at the top in a markdown cell >>>>> >>>> >>>> In my research [1], I found that the preferred folder structure for >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/ >>>> (outputs); >>>> and that they recommend creating a settings.json with path >>>> configuration (e.g. pointing to input/, src/ data/) >>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. >>>> >>>> runipy is a bit more scriptable than requiring notebook gui >>>> interactions [3]. >>>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb >>>> sources, or we could write a function in src/data.py to read >>>> '../settings.json' into a dict with the recommended variable names [1]: >>>> >>>> from data import read_settings_json >>>> settings = read_settings_json() >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>>> # .... >>>> pd.write_csv(settings['SUBMISSION_PATH']) >>>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom >>>> ment-267236556 >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/master/src >>>> [3] https://pypi.python.org/pypi/runipy >>>> >>>> >>>>> >>>>> I initially thought github was a good way to go, but I don't know if >>>>> everyone has a github acct or is interested in starting one. Maybe >>>>> email >>>>> is the way to go? >>>>> >>>> >>>> I'm all for GitHub: >>>> >>>> - git source control and revision numbers >>>> - we're not able to easily share code in the mailing list >>>> - we can learn from each others' solutions >>>> >>> >>> An example of mailing list limitations: >>> >>> >>> Your mail to 'Omaha' with the subject >>> >>> Re: [omaha] Group Data Science Competition >>> >>> Is being held until the list moderator can review it for approval. >>> >>> The reason it is being held: >>> >>> Message body is too big: 47004 bytes with a limit of 40 KB >>> >>> (I trimmed out the reply chain; so this may make it through first) >>> >> >> > From bob.haffner at gmail.com Sat Dec 17 17:34:04 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Sat, 17 Dec 2016 16:34:04 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Just submitted another Linear Regression attempt (0.16136). Added some features, both numeric and categorical, and created 3 numerics -TotalFullBaths -TotalHalfBaths -Pool Notebook attached On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner wrote: > Just submitted another Linear Regression attempt (0.16136). Added some > features, both numeric and categorical, and created 3 numerics > > -TotalFullBaths > -TotalHalfBaths > -Pool > > Notebook attached > > On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner wrote: > >> >> >> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner wrote: >> >>> >>> >>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha >> > wrote: >>> >>>> >Does Kaggle take the high mark but still give a score for each >>>> submission? >>>> Yes. >>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >>>> chniques/submissions >>>> >>>> >>>> >Thinking of ways to keep track of which code produced which score; I'll >>>> >post about the GitHub setup in a bit. >>>> We could push our notebooks to the github repo? Maybe include a brief >>>> description at the top in a markdown cell >>>> >>> >>> In my research [1], I found that the preferred folder structure for >>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/ >>> (outputs); >>> and that they recommend creating a settings.json with path configuration >>> (e.g. pointing to input/, src/ data/) >>> >>> So, we could put notebooks, folders, and repos in src/ [2]. >>> >>> runipy is a bit more scriptable than requiring notebook gui interactions >>> [3]. >>> >>> We could either hardcode '../input/test.csv' in our .py and .ipnb >>> sources, or we could write a function in src/data.py to read >>> '../settings.json' into a dict with the recommended variable names [1]: >>> >>> from data import read_settings_json >>> settings = read_settings_json() >>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>> # .... >>> pd.write_csv(settings['SUBMISSION_PATH']) >>> >>> [1] https://github.com/omahapython/datascience/issues/3#issuecom >>> ment-267236556 >>> [2] https://github.com/omahapython/kaggle-houseprices/tree/master/src >>> [3] https://pypi.python.org/pypi/runipy >>> >>> >>>> >>>> I initially thought github was a good way to go, but I don't know if >>>> everyone has a github acct or is interested in starting one. Maybe >>>> email >>>> is the way to go? >>>> >>> >>> I'm all for GitHub: >>> >>> - git source control and revision numbers >>> - we're not able to easily share code in the mailing list >>> - we can learn from each others' solutions >>> >> >> An example of mailing list limitations: >> >> >> Your mail to 'Omaha' with the subject >> >> Re: [omaha] Group Data Science Competition >> >> Is being held until the list moderator can review it for approval. >> >> The reason it is being held: >> >> Message body is too big: 47004 bytes with a limit of 40 KB >> >> (I trimmed out the reply chain; so this may make it through first) >> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: kaggle_house_prices.ipynb Type: application/octet-stream Size: 29833 bytes Desc: not available URL: From wes.turner at gmail.com Sun Dec 18 04:45:51 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 18 Dec 2016 03:45:51 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Sounds great. 1/18. I just submitted my first submission.csv to Kaggle! [1] $ python ./tpot_house_prices__001__modified.py class_sum: 264144946 abs error: 5582809.288 % error: 2.11354007432 % error**2: 252508654837.0 # python ./tpot_house_prices__001__modified.py ... Which moves us up to #1370! Your Best Entry ? You improved on your best score by 0.02469. You just moved up 608 positions on the leaderboard. I have a few more things to try: - Manually drop the 'Id' column - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance - I got an oom error w/ an 8GB notebook (at 25/120 w/ verbosity=2) - https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/data.py#L94 - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters - http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV - https://github.com/rsteca/sklearn-deap - REF,BLD,DOC,TST: - factor constants out in favor of settings.json and data.py - https://github.com/omahapython/kaggle-houseprices/blob/master/src/data.py - implement train.py and predict.py, too - create a Dockerfile FROM kaggle/docker-python:latest - https://github.com/omahapython/datascience/issues/3 "Kaggle Best Practices" - docstrings, tests - https://github.com/omahapython/datascience/wiki/resources [1] https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/pipelines/tpot_house_prices__001__modified.py On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha wrote: > Hey all, regarding our January kaggle meetup that we talked about. Maybe > we can meet following our regular monthly (1/18). > > Would that be easier/better for everyone? > > On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner > wrote: > > > Just submitted another Linear Regression attempt (0.16136). Added some > > features, both numeric and categorical, and created 3 numerics > > > > -TotalFullBaths > > -TotalHalfBaths > > -Pool > > > > Notebook attached > > > > > > > > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner > > wrote: > > > >> Just submitted another Linear Regression attempt (0.16136). Added some > >> features, both numeric and categorical, and created 3 numerics > >> > >> -TotalFullBaths > >> -TotalHalfBaths > >> -Pool > >> > >> Notebook attached > >> > >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner > wrote: > >> > >>> > >>> > >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner > >>> wrote: > >>> > >>>> > >>>> > >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < > >>>> omaha at python.org> wrote: > >>>> > >>>>> >Does Kaggle take the high mark but still give a score for each > >>>>> submission? > >>>>> Yes. > >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te > >>>>> chniques/submissions > >>>>> > >>>>> > >>>>> >Thinking of ways to keep track of which code produced which score; > >>>>> I'll > >>>>> >post about the GitHub setup in a bit. > >>>>> We could push our notebooks to the github repo? Maybe include a > brief > >>>>> description at the top in a markdown cell > >>>>> > >>>> > >>>> In my research [1], I found that the preferred folder structure for > >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/ > >>>> (outputs); > >>>> and that they recommend creating a settings.json with path > >>>> configuration (e.g. pointing to input/, src/ data/) > >>>> > >>>> So, we could put notebooks, folders, and repos in src/ [2]. > >>>> > >>>> runipy is a bit more scriptable than requiring notebook gui > >>>> interactions [3]. > >>>> > >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb > >>>> sources, or we could write a function in src/data.py to read > >>>> '../settings.json' into a dict with the recommended variable names > [1]: > >>>> > >>>> from data import read_settings_json > >>>> settings = read_settings_json() > >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) > >>>> # .... > >>>> pd.write_csv(settings['SUBMISSION_PATH']) > >>>> > >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom > >>>> ment-267236556 > >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/master/src > >>>> [3] https://pypi.python.org/pypi/runipy > >>>> > >>>> > >>>>> > >>>>> I initially thought github was a good way to go, but I don't know if > >>>>> everyone has a github acct or is interested in starting one. Maybe > >>>>> email > >>>>> is the way to go? > >>>>> > >>>> > >>>> I'm all for GitHub: > >>>> > >>>> - git source control and revision numbers > >>>> - we're not able to easily share code in the mailing list > >>>> - we can learn from each others' solutions > >>>> > >>> > >>> An example of mailing list limitations: > >>> > >>> > >>> Your mail to 'Omaha' with the subject > >>> > >>> Re: [omaha] Group Data Science Competition > >>> > >>> Is being held until the list moderator can review it for approval. > >>> > >>> The reason it is being held: > >>> > >>> Message body is too big: 47004 bytes with a limit of 40 KB > >>> > >>> (I trimmed out the reply chain; so this may make it through first) > >>> > >> > >> > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sun Dec 18 05:11:08 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 18 Dec 2016 04:11:08 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: In addition to posting to the mailing list, I created a comment on the "Kaggle Submissions" issue [1]: - Score: 0.13667 (#1370) > - > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3925119 > - https://mail.python.org/pipermail/omaha/2016-December/002206.html > - > https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/pipelines/tpot_house_prices__001__modified.py [1] https://github.com/omahapython/kaggle-houseprices/issues/2 On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner wrote: > Sounds great. 1/18. > > I just submitted my first submission.csv to Kaggle! [1] > > $ python ./tpot_house_prices__001__modified.py > class_sum: 264144946 > abs error: 5582809.288 > % error: 2.11354007432 % > error**2: 252508654837.0 > # python ./tpot_house_prices__001__modified.py > > > ... Which moves us up to #1370! > > Your Best Entry ? > You improved on your best score by 0.02469. > You just moved up 608 positions on the leaderboard. > > > I have a few more things to try: > > > - Manually drop the 'Id' column > - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance > - I got an oom error w/ an 8GB notebook (at 25/120 w/ verbosity=2) > - https://github.com/westurner/house_prices/blob/2839ff8a/ > house_prices/data.py#L94 > - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters > - http://scikit-learn.org/stable/modules/generated/ > sklearn.model_selection.GridSearchCV.html#sklearn. > model_selection.GridSearchCV > > - https://github.com/rsteca/sklearn-deap > - REF,BLD,DOC,TST: > - factor constants out in favor of settings.json and data.py > - https://github.com/omahapython/kaggle- > houseprices/blob/master/src/data.py > > - implement train.py and predict.py, too > - create a Dockerfile FROM kaggle/docker-python:latest > - https://github.com/omahapython/datascience/issues/3 "Kaggle > Best Practices" > - docstrings, tests > - https://github.com/omahapython/datascience/wiki/resources > > [1] https://github.com/westurner/house_prices/blob/2839ff8a/ > house_prices/pipelines/tpot_house_prices__001__modified.py > > On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha > wrote: > >> Hey all, regarding our January kaggle meetup that we talked about. Maybe >> we can meet following our regular monthly (1/18). >> >> Would that be easier/better for everyone? >> >> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner >> wrote: >> >> > Just submitted another Linear Regression attempt (0.16136). Added some >> > features, both numeric and categorical, and created 3 numerics >> > >> > -TotalFullBaths >> > -TotalHalfBaths >> > -Pool >> > >> > Notebook attached >> > >> > >> > >> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner >> > wrote: >> > >> >> Just submitted another Linear Regression attempt (0.16136). Added some >> >> features, both numeric and categorical, and created 3 numerics >> >> >> >> -TotalFullBaths >> >> -TotalHalfBaths >> >> -Pool >> >> >> >> Notebook attached >> >> >> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner >> wrote: >> >> >> >>> >> >>> >> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner >> >>> wrote: >> >>> >> >>>> >> >>>> >> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >> >>>> omaha at python.org> wrote: >> >>>> >> >>>>> >Does Kaggle take the high mark but still give a score for each >> >>>>> submission? >> >>>>> Yes. >> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >> >>>>> chniques/submissions >> >>>>> >> >>>>> >> >>>>> >Thinking of ways to keep track of which code produced which score; >> >>>>> I'll >> >>>>> >post about the GitHub setup in a bit. >> >>>>> We could push our notebooks to the github repo? Maybe include a >> brief >> >>>>> description at the top in a markdown cell >> >>>>> >> >>>> >> >>>> In my research [1], I found that the preferred folder structure for >> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/ >> >>>> (outputs); >> >>>> and that they recommend creating a settings.json with path >> >>>> configuration (e.g. pointing to input/, src/ data/) >> >>>> >> >>>> So, we could put notebooks, folders, and repos in src/ [2]. >> >>>> >> >>>> runipy is a bit more scriptable than requiring notebook gui >> >>>> interactions [3]. >> >>>> >> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb >> >>>> sources, or we could write a function in src/data.py to read >> >>>> '../settings.json' into a dict with the recommended variable names >> [1]: >> >>>> >> >>>> from data import read_settings_json >> >>>> settings = read_settings_json() >> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >> >>>> # .... >> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >> >>>> >> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom >> >>>> ment-267236556 >> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/ >> master/src >> >>>> [3] https://pypi.python.org/pypi/runipy >> >>>> >> >>>> >> >>>>> >> >>>>> I initially thought github was a good way to go, but I don't know if >> >>>>> everyone has a github acct or is interested in starting one. Maybe >> >>>>> email >> >>>>> is the way to go? >> >>>>> >> >>>> >> >>>> I'm all for GitHub: >> >>>> >> >>>> - git source control and revision numbers >> >>>> - we're not able to easily share code in the mailing list >> >>>> - we can learn from each others' solutions >> >>>> >> >>> >> >>> An example of mailing list limitations: >> >>> >> >>> >> >>> Your mail to 'Omaha' with the subject >> >>> >> >>> Re: [omaha] Group Data Science Competition >> >>> >> >>> Is being held until the list moderator can review it for approval. >> >>> >> >>> The reason it is being held: >> >>> >> >>> Message body is too big: 47004 bytes with a limit of 40 KB >> >>> >> >>> (I trimmed out the reply chain; so this may make it through first) >> >>> >> >> >> >> >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > From adam at scha.al Sat Dec 17 18:06:44 2016 From: adam at scha.al (Adam Schaal) Date: Sat, 17 Dec 2016 17:06:44 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: I'd be interested in joining. I just wrapped school, so should be more available for meetings and such going forward. Github username: clevernyyyy Thanks! Adam Schaal On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha wrote: > I've created: > > - an omahapython github organization account: > https://github.com/omahapython > > - an @omahapython/datascience team: > https://github.com/orgs/omahapython/teams/datascience > > - an omahapython/datascience repository > https://github.com/omahapython/datascience- > > - omahapython/datascience#3: "Kaggle Best Practices" > https://github.com/omahapython/datascience/issues/3 > > - an omahapython/kaggle-houseprices repository > https://github.com/omahapython/kaggle-houseprices > > - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" > https://github.com/omahapython/kaggle-houseprices/issues/1 > > REQ (Request): Please reply with your github username if you want to be > added to the omahapython org and/or the omahapython/datascience team. All I > need is either: > > username, omahapython > > username, omahapython, @omahapython/datascience > > > "Use @omahapython/datascience to mention this team in comments." > > The @omahapython/datascience team has write access to kaggle-houseprices > (where I'll soon create the recommended kaggle competition folder structure > compiled in omahapython/datascience#3). > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bob.haffner at gmail.com Sat Dec 17 17:56:59 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Sat, 17 Dec 2016 16:56:59 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: I believe my prior email was rejected because of the attachment size so I'm sending a link to my notebook instead. I apologize if there are duplicates. https://github.com/bobhaffner/kaggle-houseprices/blob/master/kaggle_house_prices.ipynb *************** Just submitted another Linear Regression attempt (0.16136). Added some features, both numeric and categorical, and created 3 numerics -TotalFullBaths -TotalHalfBaths -Pool Notebook https://github.com/bobhaffner/kaggle-houseprices/blob/master/kaggle_house_prices.ipynb On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner wrote: > > > On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha > wrote: > >> >Does Kaggle take the high mark but still give a score for each >> submission? >> Yes. >> https://www.kaggle.com/c/house-prices-advanced-regression- >> techniques/submissions >> >> >> >Thinking of ways to keep track of which code produced which score; I'll >> >post about the GitHub setup in a bit. >> We could push our notebooks to the github repo? Maybe include a brief >> description at the top in a markdown cell >> > > In my research [1], I found that the preferred folder structure for kaggle > is input/ (data), src/ (.py, .ipnb notebooks), and working/ (outputs); > and that they recommend creating a settings.json with path configuration > (e.g. pointing to input/, src/ data/) > > So, we could put notebooks, folders, and repos in src/ [2]. > > runipy is a bit more scriptable than requiring notebook gui interactions > [3]. > > We could either hardcode '../input/test.csv' in our .py and .ipnb sources, > or we could write a function in src/data.py to read '../settings.json' into > a dict with the recommended variable names [1]: > > from data import read_settings_json > settings = read_settings_json() > train = pd.read_csv(settings['TRAIN_DATA_PATH']) > # .... > pd.write_csv(settings['SUBMISSION_PATH']) > > [1] https://github.com/omahapython/datascience/issues/3#issuecomment- > 267236556 > [2] https://github.com/omahapython/kaggle-houseprices/tree/master/src > [3] https://pypi.python.org/pypi/runipy > > >> >> I initially thought github was a good way to go, but I don't know if >> everyone has a github acct or is interested in starting one. Maybe email >> is the way to go? >> > > I'm all for GitHub: > > - git source control and revision numbers > - we're not able to easily share code in the mailing list > - we can learn from each others' solutions > > >> >> On Sat, Dec 17, 2016 at 1:56 PM, Wes Turner via Omaha >> wrote: >> >> > On Saturday, December 17, 2016, Luke Schollmeyer via Omaha < >> > omaha at python.org> >> > wrote: >> > >> > > Made a quick attempt between middle school basketball games...did >> nothing >> > > more than pull in all non-categorical variables (rather than the five >> we >> > > used) and filled NAs with the variable mean. Jumped the score to >> 0.25857 >> > - >> > > better than Wednesday's but not better than Jeremy's (good job, btw). >> Not >> > > really that much there to add to the notebook. >> > >> > >> > Does Kaggle take the high mark but still give a score for each >> submission? >> > >> > Thinking of ways to keep track of which code produced which score; I'll >> > post about the GitHub setup in a bit. >> > >> > >> > > >> > > Next attempt I'll still stick with LR, but will work on the >> categorical's >> > > and look at better ways to fill missing values (using the mean is a >> > hack). >> > >> > >> > https://github.com/rhiever/datacleaner/blob/master/README.md >> > >> > > Replaces missing values with the mode (for categorical variables) or >> > median (for continuous variables) on a column-by-column basis >> > >> > >> > > >> > > Luke >> > > >> > > On Sat, Dec 17, 2016 at 9:24 AM, Bob Haffner via Omaha < >> omaha at python.org >> > > > >> > > wrote: >> > > >> > > > We jumped up 58 spots thanks to Jeremy! We are allotted 5 >> submissions >> > > per >> > > > day so feel free to give it a go >> > > > >> > > > On Wed, Dec 14, 2016 at 10:16 PM, Bob Haffner < >> bob.haffner at gmail.com >> > > > >> > > > wrote: >> > > > >> > > > > Good turnout tonight! We managed to form a team, explore some >> data, >> > > pick >> > > > > some features, fit a model and make a submission. >> > > > > >> > > > > We are currently sitting just outside the top 10 in 2,350th place >> :-) >> > > > > We'll be climbing the ranks in no time though! >> > > > > >> > > > > Still room for more folks if anyone else is interested. FYI, >> we're >> > > going >> > > > > to try and meet in January >> > > > > >> > > > > On Wed, Dec 14, 2016 at 12:19 PM, Wes Turner via Omaha < >> > > omaha at python.org >> > > > > >> > > > > wrote: >> > > > > >> > > > >> - https://github.com/westurner/house_prices (BSD 3-Clause, TPOT >> is >> > > > GPLv3) >> > > > >> - https://github.com/westurner/house_prices/commits/develop >> > > > >> - https://github.com/westurner/house_prices/blob/develop/ >> > > > environment.yml >> > > > >> - >> > > > >> https://github.com/westurner/house_prices/blob/develop/tests >> > > > >> /test_house_prices.py >> > > > >> - >> > > > >> https://github.com/westurner/house_prices/blob/develop/house >> > > > >> _prices/analysis.py >> > > > >> - >> > > > >> https://github.com/westurner/house_prices/blob/develop/house >> > > > >> _prices/data.py >> > > > >> >> > > > >> cookiecutter, hubflow >> > > > >> >> > > > >> data.py loads the data into a sklearn Bunch after >> pd.do_get_dummies >> > > (and >> > > > >> dataclean.autoclean, while I figure out how to use >> OneHotEncoder). >> > > > >> >> > > > >> - [ ] update the docstrings from load_boston() >> > > > >> - https://github.com/rhiever/datacleaner/issues/1# >> > > > issuecomment-266980937 >> > > > >> "I think it illogical to e.g. average Exterior1st in the >> Kaggle >> > > House >> > > > >> Prices Dataset: the average of ImStucc and Wd Sdng seems >> > nonsensical?" >> > > > >> >> > > > >> analysis.py HEAD-1 (before I wrapped it in a class while waiting) >> > > looks >> > > > >> like it'll be another 4-5 hours on this notebook from 2009. >> > > > >> >> > > > >> - PCA? There's probably a better way. >> > > > >> >> > > > >> I added an environment.yml but haven't yet determined the minimal >> > set >> > > > for >> > > > >> setup.py, so >> > > > >> >> > > > >> conda env update -f=environment.yml # make condaenvupdate >> > > > >> >> > > > >> >> > > > >> I'm supposed to be at work now. I may not be able to make it this >> > > > evening; >> > > > >> if not, good luck. >> > > > >> >> > > > >> I'll add the generated pipeline to the github repo and share the >> > URL. >> > > > >> >> > > > >> >> > > > >> On Fri, Dec 9, 2016 at 12:31 PM, Wes Turner < >> wes.turner at gmail.com >> > > > >> > > > wrote: >> > > > >> >> > > > >> > So, we need to mutate and crossover until Mean Squared Error >> (MSE) >> > > is >> > > > >> > optimally minimized? >> > > > >> > >> > > > >> > http://rhiever.github.io/tpot/examples/Boston_Example/ >> > > > >> > >> > > > >> > Looks like we need something like load_boston() in >> > > > >> > https://github.com/scikit-learn/scikit-learn/blob/ >> > > > >> > master/sklearn/datasets/base.py >> > > > >> > >> > > > >> > >> > > > >> > https://www.kaggle.com/c/house-prices-advanced-regression- >> > > > >> techniques/data >> > > > >> > >> > > > >> > >> > > > >> > On Friday, December 9, 2016, Steve Young via Omaha < >> > > omaha at python.org > >> > > > >> > wrote: >> > > > >> > >> > > > >> >> > >> > > > >> >> > Sign up for Kaggle - Check. >> > > > >> >> > Install Anaconda - Check https://docs.continuum.io/ >> > > > anaconda/install >> > > > >> >> > Basic familiarity - Check. http://conda.pydata.org >> > > > >> >> > /docs/test-drive.html#managing-conda >> > > > >> >> > Anaconda cheat sheet - Check. >> > > > >> >> > http://conda.pydata.org/docs/using/cheatsheet.html >> > > > >> >> > Pycharm and Anaconda - Check. https://www.jetbrains. >> > > > >> >> > com/help/pycharm/2016.1/conda- >> support-creating-conda-environ >> > > > >> ment.html >> > > > >> >> > >> > > > >> >> > Steve >> > > > >> >> > >> > > > >> >> > On Thu, Dec 1, 2016 at 8:32 AM, Bob Haffner via Omaha < >> > > > >> omaha at python.org >> > > > >> >> > >> > > > >> >> > wrote: >> > > > >> >> > >> > > > >> >> >> Hi All, >> > > > >> >> >> >> > > > >> >> >> We're all set for the 12/14 group Kaggle competition >> kickoff! >> > > > >> >> >> >> > > > >> >> >> All experience levels are welcome. Bring your laptop if >> you'd >> > > > like, >> > > > >> >> but >> > > > >> >> >> no >> > > > >> >> >> biggie if you don't >> > > > >> >> >> >> > > > >> >> >> I didn't hear any objections to the Housing Prices >> competition >> > > so >> > > > >> >> let's go >> > > > >> >> >> with that one >> > > > >> >> >> https://www.kaggle.com/c/house-prices-advanced-regression- >> > > > >> techniques >> > > > >> >> >> >> > > > >> >> >> Suggested things to do prior to 12/14 >> > > > >> >> >> -- Sign up on Kaggle >> > > > >> >> >> -- Get your machine set up with some pydata libraries >> > > > >> >> >> (Pandas, Numpy, SciKit-Learn and Jupyter Notebooks). I >> > > recommend >> > > > >> the >> > > > >> >> >> Anaconda distribution if you're just starting out >> > > > >> >> >> -- Get some basic familiarity with the competition problem >> and >> > > > data >> > > > >> >> >> >> > > > >> >> >> Let me know if you have any questions. >> > > > >> >> >> >> > > > >> >> >> Thanks! >> > > > >> >> >> Bob >> > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> On Tue, Oct 18, 2016 at 8:32 PM, Bob Haffner < >> > > > bob.haffner at gmail.com >> > > > >> > >> > > > >> >> >> wrote: >> > > > >> >> >> >> > > > >> >> >> > Good deal. That's 3 of us (Naomi, you and me) by my >> count. >> > > > >> Hopefully >> > > > >> >> >> > others will join in!! >> > > > >> >> >> > >> > > > >> >> >> > I would be game for a December meetup. >> > > > >> >> >> > >> > > > >> >> >> > Sent from my iPhone >> > > > >> >> >> > >> > > > >> >> >> > > On Oct 18, 2016, at 8:13 PM, Steve Young via Omaha < >> > > > >> >> omaha at python.org > >> > > > >> >> >> > wrote: >> > > > >> >> >> > > >> > > > >> >> >> > > I would enjoy participating, and learning what you data >> > guys >> > > > and >> > > > >> >> gals >> > > > >> >> >> do. >> > > > >> >> >> > > (I am not a math guy) >> > > > >> >> >> > > >> > > > >> >> >> > > If Hubert does not take December, maybe we could have a >> > > sprint >> > > > >> that >> > > > >> >> >> > night? >> > > > >> >> >> > > >> > > > >> >> >> > > Steve >> > > > >> >> >> > > >> > > > >> >> >> > > On Mon, Oct 17, 2016 at 3:05 PM, Wes Turner via Omaha < >> > > > >> >> >> omaha at python.org > >> > > > >> >> >> > > wrote: >> > > > >> >> >> > > >> > > > >> >> >> > >> On Monday, October 17, 2016, Bob Haffner via Omaha < >> > > > >> >> omaha at python.org >> > > > >> >> >> > >> > > > >> >> >> > >> wrote: >> > > > >> >> >> > >> >> > > > >> >> >> > >>> Hi All, >> > > > >> >> >> > >>> >> > > > >> >> >> > >>> A few months ago someone brought up the idea of >> doing a >> > > > Kaggle >> > > > >> >> data >> > > > >> >> >> > >> science >> > > > >> >> >> > >>> competition as a group. Is there still interest in >> > this? >> > > > >> >> >> > >>> >> > > > >> >> >> > >>> Some thoughts. >> > > > >> >> >> > >>> Not sure of the details, but Kaggle allows >> individuals >> > to >> > > > form >> > > > >> >> >> groups. >> > > > >> >> >> > >> We >> > > > >> >> >> > >>> could collaborate thru email (or perhaps something >> like >> > > > Slack) >> > > > >> >> and >> > > > >> >> >> > maybe >> > > > >> >> >> > >>> meet occasionally. When it's all said and done, we >> > could >> > > > >> present >> > > > >> >> >> at a >> > > > >> >> >> > >>> monthly meeting. >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> A GitHub (repo, issues, and sphinx docs/ and/or GH >> wiki) >> > > > could >> > > > >> >> also >> > > > >> >> >> be >> > > > >> >> >> > >> useful: >> > > > >> >> >> > >> >> > > > >> >> >> > >> - gh-pages branch built from docs/ and nb/ >> > > > >> >> >> > >> - .ipynb in notebooks/ or nb/ >> > > > >> >> >> > >> - https://github.com/audreyr/cookiecutter-pypackage/ >> has >> > > > >> >> packaging >> > > > >> >> >> and >> > > > >> >> >> > >> ReadTheDocs config >> > > > >> >> >> > >> - >> > > > >> >> >> > >> https://github.com/jupyter/docker-stacks/blob/master/ >> > > > >> >> >> > >> scipy-notebook/Dockerfile >> > > > >> >> >> > >> includes conda >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >>> >> > > > >> >> >> > >>> This one looks good. Doesn't end till March 1st >> which >> > > gives >> > > > >> us >> > > > >> >> some >> > > > >> >> >> > time >> > > > >> >> >> > >>> and it doesn't look overly complicated. No prize >> money, >> > > > >> though >> > > > >> >> :-) >> > > > >> >> >> > >>> https://www.kaggle.com/c/house-prices-advanced- >> > > regression- >> > > > >> >> >> techniques >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> - http://rhiever.github.io/tpot/ >> examples/Boston_Example/ >> > > > >> >> >> > >> >> > > > >> >> >> > >> - TPOT can utilize XGBoost (as mentioned in the >> Kaggle >> > > > >> >> competition >> > > > >> >> >> > >> description) >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> - https://github.com/donnemartin >> /data-science-ipython- >> > > > >> notebooks/ >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >>> Forming groups >> > > > >> >> >> > >>> https://www.kaggle.com/wiki/FormingATeam >> > > > >> >> >> > >>> >> > > > >> >> >> > >>> Would love to get some feedback on any of this >> > > > >> >> >> > >>> >> > > > >> >> >> > >>> Thanks, >> > > > >> >> >> > >>> Bob >> > > > >> >> >> > >>> _______________________________________________ >> > > > >> >> >> > >>> Omaha Python Users Group mailing list >> > > > >> >> >> > >>> Omaha at python.org >> > > > >> >> >> > >>> https://mail.python.org/mailman/listinfo/omaha >> > > > >> >> >> > >>> http://www.OmahaPython.org >> > > > >> >> >> > >>> >> > > > >> >> >> > >> _______________________________________________ >> > > > >> >> >> > >> Omaha Python Users Group mailing list >> > > > >> >> >> > >> Omaha at python.org >> > > > >> >> >> > >> https://mail.python.org/mailman/listinfo/omaha >> > > > >> >> >> > >> http://www.OmahaPython.org >> > > > >> >> >> > >> >> > > > >> >> >> > > _______________________________________________ >> > > > >> >> >> > > Omaha Python Users Group mailing list >> > > > >> >> >> > > Omaha at python.org >> > > > >> >> >> > > https://mail.python.org/mailman/listinfo/omaha >> > > > >> >> >> > > http://www.OmahaPython.org >> > > > >> >> >> > >> > > > >> >> >> _______________________________________________ >> > > > >> >> >> Omaha Python Users Group mailing list >> > > > >> >> >> Omaha at python.org >> > > > >> >> >> https://mail.python.org/mailman/listinfo/omaha >> > > > >> >> >> http://www.OmahaPython.org >> > > > >> >> >> >> > > > >> >> > >> > > > >> >> > >> > > > >> >> _______________________________________________ >> > > > >> >> Omaha Python Users Group mailing list >> > > > >> >> Omaha at python.org >> > > > >> >> https://mail.python.org/mailman/listinfo/omaha >> > > > >> >> http://www.OmahaPython.org >> > > > >> >> >> > > > >> > >> > > > >> _______________________________________________ >> > > > >> Omaha Python Users Group mailing list >> > > > >> Omaha at python.org >> > > > >> https://mail.python.org/mailman/listinfo/omaha >> > > > >> http://www.OmahaPython.org >> > > > >> >> > > > > >> > > > > >> > > > _______________________________________________ >> > > > Omaha Python Users Group mailing list >> > > > Omaha at python.org >> > > > https://mail.python.org/mailman/listinfo/omaha >> > > > http://www.OmahaPython.org >> > > > >> > > _______________________________________________ >> > > Omaha Python Users Group mailing list >> > > Omaha at python.org >> > > https://mail.python.org/mailman/listinfo/omaha >> > > http://www.OmahaPython.org >> > > >> > _______________________________________________ >> > Omaha Python Users Group mailing list >> > Omaha at python.org >> > https://mail.python.org/mailman/listinfo/omaha >> > http://www.OmahaPython.org >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > From bob.haffner at gmail.com Sun Dec 18 10:26:03 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Sun, 18 Dec 2016 09:26:03 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Nice job, Wes!! On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner wrote: > In addition to posting to the mailing list, I created a comment on the > "Kaggle Submissions" issue [1]: > > - Score: 0.13667 (#1370) >> - https://www.kaggle.com/c/house-prices-advanced-regression-techniques/ >> leaderboard?submissionId=3925119 >> - https://mail.python.org/pipermail/omaha/2016-December/002206.html >> - https://github.com/westurner/house_prices/blob/2839ff8a/ >> house_prices/pipelines/tpot_house_prices__001__modified.py > > > [1] https://github.com/omahapython/kaggle-houseprices/issues/2 > > On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner wrote: > >> Sounds great. 1/18. >> >> I just submitted my first submission.csv to Kaggle! [1] >> >> $ python ./tpot_house_prices__001__modified.py >> class_sum: 264144946 >> abs error: 5582809.288 >> % error: 2.11354007432 % >> error**2: 252508654837.0 >> # python ./tpot_house_prices__001__modified.py >> >> >> ... Which moves us up to #1370! >> >> Your Best Entry ? >> You improved on your best score by 0.02469. >> You just moved up 608 positions on the leaderboard. >> >> >> I have a few more things to try: >> >> >> - Manually drop the 'Id' column >> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance >> - I got an oom error w/ an 8GB notebook (at 25/120 w/ verbosity=2) >> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >> e_prices/data.py#L94 >> - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters >> - http://scikit-learn.org/stable/modules/generated/sklearn. >> model_selection.GridSearchCV.html#sklearn.model_selection. >> GridSearchCV >> >> - https://github.com/rsteca/sklearn-deap >> - REF,BLD,DOC,TST: >> - factor constants out in favor of settings.json and data.py >> - https://github.com/omahapython/kaggle-houseprices/blob/ >> master/src/data.py >> - implement train.py and predict.py, too >> - create a Dockerfile FROM kaggle/docker-python:latest >> - https://github.com/omahapython/datascience/issues/3 "Kaggle >> Best Practices" >> - docstrings, tests >> - https://github.com/omahapython/datascience/wiki/resources >> >> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous >> e_prices/pipelines/tpot_house_prices__001__modified.py >> >> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha >> wrote: >> >>> Hey all, regarding our January kaggle meetup that we talked about. Maybe >>> we can meet following our regular monthly (1/18). >>> >>> Would that be easier/better for everyone? >>> >>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner >>> wrote: >>> >>> > Just submitted another Linear Regression attempt (0.16136). Added some >>> > features, both numeric and categorical, and created 3 numerics >>> > >>> > -TotalFullBaths >>> > -TotalHalfBaths >>> > -Pool >>> > >>> > Notebook attached >>> > >>> > >>> > >>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner >>> > wrote: >>> > >>> >> Just submitted another Linear Regression attempt (0.16136). Added >>> some >>> >> features, both numeric and categorical, and created 3 numerics >>> >> >>> >> -TotalFullBaths >>> >> -TotalHalfBaths >>> >> -Pool >>> >> >>> >> Notebook attached >>> >> >>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner >>> wrote: >>> >> >>> >>> >>> >>> >>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner >>> >>> wrote: >>> >>> >>> >>>> >>> >>>> >>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >>> >>>> omaha at python.org> wrote: >>> >>>> >>> >>>>> >Does Kaggle take the high mark but still give a score for each >>> >>>>> submission? >>> >>>>> Yes. >>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >>> >>>>> chniques/submissions >>> >>>>> >>> >>>>> >>> >>>>> >Thinking of ways to keep track of which code produced which score; >>> >>>>> I'll >>> >>>>> >post about the GitHub setup in a bit. >>> >>>>> We could push our notebooks to the github repo? Maybe include a >>> brief >>> >>>>> description at the top in a markdown cell >>> >>>>> >>> >>>> >>> >>>> In my research [1], I found that the preferred folder structure for >>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/ >>> >>>> (outputs); >>> >>>> and that they recommend creating a settings.json with path >>> >>>> configuration (e.g. pointing to input/, src/ data/) >>> >>>> >>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. >>> >>>> >>> >>>> runipy is a bit more scriptable than requiring notebook gui >>> >>>> interactions [3]. >>> >>>> >>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb >>> >>>> sources, or we could write a function in src/data.py to read >>> >>>> '../settings.json' into a dict with the recommended variable names >>> [1]: >>> >>>> >>> >>>> from data import read_settings_json >>> >>>> settings = read_settings_json() >>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>> >>>> # .... >>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >>> >>>> >>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom >>> >>>> ment-267236556 >>> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/maste >>> r/src >>> >>>> [3] https://pypi.python.org/pypi/runipy >>> >>>> >>> >>>> >>> >>>>> >>> >>>>> I initially thought github was a good way to go, but I don't know >>> if >>> >>>>> everyone has a github acct or is interested in starting one. >>> Maybe >>> >>>>> email >>> >>>>> is the way to go? >>> >>>>> >>> >>>> >>> >>>> I'm all for GitHub: >>> >>>> >>> >>>> - git source control and revision numbers >>> >>>> - we're not able to easily share code in the mailing list >>> >>>> - we can learn from each others' solutions >>> >>>> >>> >>> >>> >>> An example of mailing list limitations: >>> >>> >>> >>> >>> >>> Your mail to 'Omaha' with the subject >>> >>> >>> >>> Re: [omaha] Group Data Science Competition >>> >>> >>> >>> Is being held until the list moderator can review it for approval. >>> >>> >>> >>> The reason it is being held: >>> >>> >>> >>> Message body is too big: 47004 bytes with a limit of 40 KB >>> >>> >>> >>> (I trimmed out the reply chain; so this may make it through first) >>> >>> >>> >> >>> >> >>> > >>> _______________________________________________ >>> Omaha Python Users Group mailing list >>> Omaha at python.org >>> https://mail.python.org/mailman/listinfo/omaha >>> http://www.OmahaPython.org >>> >> >> > From wes.turner at gmail.com Sun Dec 18 11:59:17 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 18 Dec 2016 10:59:17 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: On Sat, Dec 17, 2016 at 5:06 PM, Adam Schaal via Omaha wrote: > I'd be interested in joining. I just wrapped school, so should be more > available for meetings and such going forward. > > Github username: clevernyyyy > > Thanks! > I sent an invitation to the omahapython group. Do you also want to be added to the omahapython/datascience group? > > Adam Schaal > > On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha > wrote: > > > I've created: > > > > - an omahapython github organization account: > > https://github.com/omahapython > > > > - an @omahapython/datascience team: > > https://github.com/orgs/omahapython/teams/datascience > > > > - an omahapython/datascience repository > > https://github.com/omahapython/datascience- > > > > - omahapython/datascience#3: "Kaggle Best Practices" > > https://github.com/omahapython/datascience/issues/3 > > > > - an omahapython/kaggle-houseprices repository > > https://github.com/omahapython/kaggle-houseprices > > > > - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" > > https://github.com/omahapython/kaggle-houseprices/issues/1 > > > > REQ (Request): Please reply with your github username if you want to be > > added to the omahapython org and/or the omahapython/datascience team. > All I > > need is either: > > > > username, omahapython > > > > username, omahapython, @omahapython/datascience > > > > > "Use @omahapython/datascience to mention this team in comments." > > > > The @omahapython/datascience team has write access to kaggle-houseprices > > (where I'll soon create the recommended kaggle competition folder > structure > > compiled in omahapython/datascience#3). > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sun Dec 18 11:59:49 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 18 Dec 2016 10:59:49 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Thanks, Bob! On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner wrote: > Nice job, Wes!! > > On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner wrote: > >> In addition to posting to the mailing list, I created a comment on the >> "Kaggle Submissions" issue [1]: >> >> - Score: 0.13667 (#1370) >>> - https://www.kaggle.com/c/house-prices-advanced-regression- >>> techniques/leaderboard?submissionId=3925119 >>> - https://mail.python.org/pipermail/omaha/2016-December/002206.html >>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>> e_prices/pipelines/tpot_house_prices__001__modified.py >> >> >> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 >> >> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner wrote: >> >>> Sounds great. 1/18. >>> >>> I just submitted my first submission.csv to Kaggle! [1] >>> >>> $ python ./tpot_house_prices__001__modified.py >>> class_sum: 264144946 >>> abs error: 5582809.288 >>> % error: 2.11354007432 % >>> error**2: 252508654837.0 >>> # python ./tpot_house_prices__001__modified.py >>> >>> >>> ... Which moves us up to #1370! >>> >>> Your Best Entry ? >>> You improved on your best score by 0.02469. >>> You just moved up 608 positions on the leaderboard. >>> >>> >>> I have a few more things to try: >>> >>> >>> - Manually drop the 'Id' column >>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance >>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ verbosity=2) >>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>> e_prices/data.py#L94 >>> - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters >>> - http://scikit-learn.org/stable/modules/generated/sklearn.mod >>> el_selection.GridSearchCV.html#sklearn.model_selection.GridS >>> earchCV >>> >>> - https://github.com/rsteca/sklearn-deap >>> - REF,BLD,DOC,TST: >>> - factor constants out in favor of settings.json and data.py >>> - https://github.com/omahapython/kaggle-houseprices/blob/maste >>> r/src/data.py >>> - implement train.py and predict.py, too >>> - create a Dockerfile FROM kaggle/docker-python:latest >>> - https://github.com/omahapython/datascience/issues/3 "Kaggle >>> Best Practices" >>> - docstrings, tests >>> - https://github.com/omahapython/datascience/wiki/resources >>> >>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous >>> e_prices/pipelines/tpot_house_prices__001__modified.py >>> >>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha >> > wrote: >>> >>>> Hey all, regarding our January kaggle meetup that we talked about. >>>> Maybe >>>> we can meet following our regular monthly (1/18). >>>> >>>> Would that be easier/better for everyone? >>>> >>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner >>>> wrote: >>>> >>>> > Just submitted another Linear Regression attempt (0.16136). Added >>>> some >>>> > features, both numeric and categorical, and created 3 numerics >>>> > >>>> > -TotalFullBaths >>>> > -TotalHalfBaths >>>> > -Pool >>>> > >>>> > Notebook attached >>>> > >>>> > >>>> > >>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner >>>> > wrote: >>>> > >>>> >> Just submitted another Linear Regression attempt (0.16136). Added >>>> some >>>> >> features, both numeric and categorical, and created 3 numerics >>>> >> >>>> >> -TotalFullBaths >>>> >> -TotalHalfBaths >>>> >> -Pool >>>> >> >>>> >> Notebook attached >>>> >> >>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner >>>> wrote: >>>> >> >>>> >>> >>>> >>> >>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner >>>> >>> wrote: >>>> >>> >>>> >>>> >>>> >>>> >>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >>>> >>>> omaha at python.org> wrote: >>>> >>>> >>>> >>>>> >Does Kaggle take the high mark but still give a score for each >>>> >>>>> submission? >>>> >>>>> Yes. >>>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >>>> >>>>> chniques/submissions >>>> >>>>> >>>> >>>>> >>>> >>>>> >Thinking of ways to keep track of which code produced which >>>> score; >>>> >>>>> I'll >>>> >>>>> >post about the GitHub setup in a bit. >>>> >>>>> We could push our notebooks to the github repo? Maybe include a >>>> brief >>>> >>>>> description at the top in a markdown cell >>>> >>>>> >>>> >>>> >>>> >>>> In my research [1], I found that the preferred folder structure for >>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/ >>>> >>>> (outputs); >>>> >>>> and that they recommend creating a settings.json with path >>>> >>>> configuration (e.g. pointing to input/, src/ data/) >>>> >>>> >>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. >>>> >>>> >>>> >>>> runipy is a bit more scriptable than requiring notebook gui >>>> >>>> interactions [3]. >>>> >>>> >>>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb >>>> >>>> sources, or we could write a function in src/data.py to read >>>> >>>> '../settings.json' into a dict with the recommended variable names >>>> [1]: >>>> >>>> >>>> >>>> from data import read_settings_json >>>> >>>> settings = read_settings_json() >>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>>> >>>> # .... >>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >>>> >>>> >>>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom >>>> >>>> ment-267236556 >>>> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/maste >>>> r/src >>>> >>>> [3] https://pypi.python.org/pypi/runipy >>>> >>>> >>>> >>>> >>>> >>>>> >>>> >>>>> I initially thought github was a good way to go, but I don't know >>>> if >>>> >>>>> everyone has a github acct or is interested in starting one. >>>> Maybe >>>> >>>>> email >>>> >>>>> is the way to go? >>>> >>>>> >>>> >>>> >>>> >>>> I'm all for GitHub: >>>> >>>> >>>> >>>> - git source control and revision numbers >>>> >>>> - we're not able to easily share code in the mailing list >>>> >>>> - we can learn from each others' solutions >>>> >>>> >>>> >>> >>>> >>> An example of mailing list limitations: >>>> >>> >>>> >>> >>>> >>> Your mail to 'Omaha' with the subject >>>> >>> >>>> >>> Re: [omaha] Group Data Science Competition >>>> >>> >>>> >>> Is being held until the list moderator can review it for approval. >>>> >>> >>>> >>> The reason it is being held: >>>> >>> >>>> >>> Message body is too big: 47004 bytes with a limit of 40 KB >>>> >>> >>>> >>> (I trimmed out the reply chain; so this may make it through first) >>>> >>> >>>> >> >>>> >> >>>> > >>>> _______________________________________________ >>>> Omaha Python Users Group mailing list >>>> Omaha at python.org >>>> https://mail.python.org/mailman/listinfo/omaha >>>> http://www.OmahaPython.org >>>> >>> >>> >> > From bob.haffner at gmail.com Sun Dec 18 12:55:07 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Sun, 18 Dec 2016 11:55:07 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Wes, I can try to run your process with do_get_dummies=True. Anything else need to change? Sent from my iPhone > On Dec 18, 2016, at 10:59 AM, Wes Turner wrote: > > Thanks, Bob! > >> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner wrote: >> Nice job, Wes!! >> >>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner wrote: >>> In addition to posting to the mailing list, I created a comment on the "Kaggle Submissions" issue [1]: >>> >>>> - Score: 0.13667 (#1370) >>>> - https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3925119 >>>> - https://mail.python.org/pipermail/omaha/2016-December/002206.html >>>> - https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/pipelines/tpot_house_prices__001__modified.py >>> >>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 >>> >>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner wrote: >>>> Sounds great. 1/18. >>>> >>>> I just submitted my first submission.csv to Kaggle! [1] >>>> >>>> $ python ./tpot_house_prices__001__modified.py >>>> class_sum: 264144946 >>>> abs error: 5582809.288 >>>> % error: 2.11354007432 % >>>> error**2: 252508654837.0 >>>> # python ./tpot_house_prices__001__modified.py >>>> >>>> ... Which moves us up to #1370! >>>> >>>> Your Best Entry ? >>>> You improved on your best score by 0.02469. >>>> You just moved up 608 positions on the leaderboard. >>>> >>>> >>>> I have a few more things to try: >>>> >>>> Manually drop the 'Id' column >>>> do_get_dummies=True (data.py) + EC2 m4.4xlarge instance >>>> I got an oom error w/ an 8GB notebook (at 25/120 w/ verbosity=2) >>>> https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/data.py#L94 >>>> skleanGridSearch and/or sklearn-deap the TPOT hyperparameters >>>> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV >>>> https://github.com/rsteca/sklearn-deap >>>> REF,BLD,DOC,TST: >>>> factor constants out in favor of settings.json and data.py >>>> https://github.com/omahapython/kaggle-houseprices/blob/master/src/data.py >>>> implement train.py and predict.py, too >>>> create a Dockerfile FROM kaggle/docker-python:latest >>>> https://github.com/omahapython/datascience/issues/3 "Kaggle Best Practices" >>>> docstrings, tests >>>> https://github.com/omahapython/datascience/wiki/resources >>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/pipelines/tpot_house_prices__001__modified.py >>>> >>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha wrote: >>>>> Hey all, regarding our January kaggle meetup that we talked about. Maybe >>>>> we can meet following our regular monthly (1/18). >>>>> >>>>> Would that be easier/better for everyone? >>>>> >>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner wrote: >>>>> >>>>> > Just submitted another Linear Regression attempt (0.16136). Added some >>>>> > features, both numeric and categorical, and created 3 numerics >>>>> > >>>>> > -TotalFullBaths >>>>> > -TotalHalfBaths >>>>> > -Pool >>>>> > >>>>> > Notebook attached >>>>> > >>>>> > >>>>> > >>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner >>>>> > wrote: >>>>> > >>>>> >> Just submitted another Linear Regression attempt (0.16136). Added some >>>>> >> features, both numeric and categorical, and created 3 numerics >>>>> >> >>>>> >> -TotalFullBaths >>>>> >> -TotalHalfBaths >>>>> >> -Pool >>>>> >> >>>>> >> Notebook attached >>>>> >> >>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner wrote: >>>>> >> >>>>> >>> >>>>> >>> >>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner >>>>> >>> wrote: >>>>> >>> >>>>> >>>> >>>>> >>>> >>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >>>>> >>>> omaha at python.org> wrote: >>>>> >>>> >>>>> >>>>> >Does Kaggle take the high mark but still give a score for each >>>>> >>>>> submission? >>>>> >>>>> Yes. >>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >>>>> >>>>> chniques/submissions >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >Thinking of ways to keep track of which code produced which score; >>>>> >>>>> I'll >>>>> >>>>> >post about the GitHub setup in a bit. >>>>> >>>>> We could push our notebooks to the github repo? Maybe include a brief >>>>> >>>>> description at the top in a markdown cell >>>>> >>>>> >>>>> >>>> >>>>> >>>> In my research [1], I found that the preferred folder structure for >>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/ >>>>> >>>> (outputs); >>>>> >>>> and that they recommend creating a settings.json with path >>>>> >>>> configuration (e.g. pointing to input/, src/ data/) >>>>> >>>> >>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. >>>>> >>>> >>>>> >>>> runipy is a bit more scriptable than requiring notebook gui >>>>> >>>> interactions [3]. >>>>> >>>> >>>>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb >>>>> >>>> sources, or we could write a function in src/data.py to read >>>>> >>>> '../settings.json' into a dict with the recommended variable names [1]: >>>>> >>>> >>>>> >>>> from data import read_settings_json >>>>> >>>> settings = read_settings_json() >>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>>>> >>>> # .... >>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >>>>> >>>> >>>>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom >>>>> >>>> ment-267236556 >>>>> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/master/src >>>>> >>>> [3] https://pypi.python.org/pypi/runipy >>>>> >>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>>> I initially thought github was a good way to go, but I don't know if >>>>> >>>>> everyone has a github acct or is interested in starting one. Maybe >>>>> >>>>> email >>>>> >>>>> is the way to go? >>>>> >>>>> >>>>> >>>> >>>>> >>>> I'm all for GitHub: >>>>> >>>> >>>>> >>>> - git source control and revision numbers >>>>> >>>> - we're not able to easily share code in the mailing list >>>>> >>>> - we can learn from each others' solutions >>>>> >>>> >>>>> >>> >>>>> >>> An example of mailing list limitations: >>>>> >>> >>>>> >>> >>>>> >>> Your mail to 'Omaha' with the subject >>>>> >>> >>>>> >>> Re: [omaha] Group Data Science Competition >>>>> >>> >>>>> >>> Is being held until the list moderator can review it for approval. >>>>> >>> >>>>> >>> The reason it is being held: >>>>> >>> >>>>> >>> Message body is too big: 47004 bytes with a limit of 40 KB >>>>> >>> >>>>> >>> (I trimmed out the reply chain; so this may make it through first) >>>>> >>> >>>>> >> >>>>> >> >>>>> > >>>>> _______________________________________________ >>>>> Omaha Python Users Group mailing list >>>>> Omaha at python.org >>>>> https://mail.python.org/mailman/listinfo/omaha >>>>> http://www.OmahaPython.org >>>> >>> >> > From wes.turner at gmail.com Sun Dec 18 14:00:14 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 18 Dec 2016 13:00:14 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner wrote: > Wes, I can try to run your process with do_get_dummies=True. Anything > else need to change? > Yup, https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/data.py#L94 : if do_get_dummies: def get_categorical_columns(column_categories): for colkey in column_categories: values = column_categories[colkey] if len(values): yield colkey categorical_columns = list(get_categorical_columns(column_categories)) get_dummies_dict = {key: key for key in categorical_columns} df = pd.get_dummies(df, prefix=get_dummies_dict, columns=get_dummies_dict) Needs to also be applied to train_csv and test_csv in the generated and modified pipeline: https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/pipelines/tpot_house_prices__001__modified.py#L40 So, I can either copy/paste or factor or it out: - copy/paste: just wrong - factor it out: - this creates a (new) dependency on house_prices from within the generated pipeline; which currently depends on [stable versions of] (datacleaner, pandas, and scikit-learn) ... TODO: today - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df) - [ ] Dockerfile - probably the easiest way to reproduce the environment.yml - [ ] automate the __modified.py patching process : # git clone ssh://git at github.com/westurner/house_prices # -b develop conda env update -f ./environment.yml cd house_prices/ python ./analysis.py # (wait) mv ./pipelines/tpot_house_prices_.py \ ./pipelines/tpot_house_prices__002.py mv ./pipelines/tpot_house_prices_.py.json \ ./pipelines/tpot_house_prices__002.py.json cp ./pipelines/tpot_house_prices__001__modified.py \ ./pipelines/tpot_house_prices__002__modified.py # copy/paste (TODO: patch/template): # - exported_pipeline / self.exported_pipeline # - the sklearn imports] to __002__modified.py cd pipelines/ # TODO: settings.json python ./tpot_house_prices__002__modified.py ... The modified pipeline generation is not quite reproducible yet, but the generated pipeline (tpot_house_prices__001[__modified].py) is. (With ~2% error ... only about ~$6mn dollars off :|) > Sent from my iPhone > > On Dec 18, 2016, at 10:59 AM, Wes Turner wrote: > > Thanks, Bob! > > On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner > wrote: > >> Nice job, Wes!! >> >> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner wrote: >> >>> In addition to posting to the mailing list, I created a comment on the >>> "Kaggle Submissions" issue [1]: >>> >>> - Score: 0.13667 (#1370) >>>> - https://www.kaggle.com/c/house-prices-advanced-regression-te >>>> chniques/leaderboard?submissionId=3925119 >>>> - https://mail.python.org/pipermail/omaha/2016-December/002206.html >>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>> >>> >>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 >>> >>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner >>> wrote: >>> >>>> Sounds great. 1/18. >>>> >>>> I just submitted my first submission.csv to Kaggle! [1] >>>> >>>> $ python ./tpot_house_prices__001__modified.py >>>> class_sum: 264144946 >>>> abs error: 5582809.288 >>>> % error: 2.11354007432 % >>>> error**2: 252508654837.0 >>>> # python ./tpot_house_prices__001__modified.py >>>> >>>> >>>> ... Which moves us up to #1370! >>>> >>>> Your Best Entry ? >>>> You improved on your best score by 0.02469. >>>> You just moved up 608 positions on the leaderboard. >>>> >>>> >>>> I have a few more things to try: >>>> >>>> >>>> - Manually drop the 'Id' column >>>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance >>>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ >>>> verbosity=2) >>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>> e_prices/data.py#L94 >>>> - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters >>>> - http://scikit-learn.org/stable/modules/generated/sklearn.mod >>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridS >>>> earchCV >>>> >>>> - https://github.com/rsteca/sklearn-deap >>>> - REF,BLD,DOC,TST: >>>> - factor constants out in favor of settings.json and data.py >>>> - https://github.com/omahapython/kaggle-houseprices/blob/maste >>>> r/src/data.py >>>> - implement train.py and predict.py, too >>>> - create a Dockerfile FROM kaggle/docker-python:latest >>>> - https://github.com/omahapython/datascience/issues/3 "Kaggle >>>> Best Practices" >>>> - docstrings, tests >>>> - https://github.com/omahapython/datascience/wiki/resources >>>> >>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>>> >>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha < >>>> omaha at python.org> wrote: >>>> >>>>> Hey all, regarding our January kaggle meetup that we talked about. >>>>> Maybe >>>>> we can meet following our regular monthly (1/18). >>>>> >>>>> Would that be easier/better for everyone? >>>>> >>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner >>>>> wrote: >>>>> >>>>> > Just submitted another Linear Regression attempt (0.16136). Added >>>>> some >>>>> > features, both numeric and categorical, and created 3 numerics >>>>> > >>>>> > -TotalFullBaths >>>>> > -TotalHalfBaths >>>>> > -Pool >>>>> > >>>>> > Notebook attached >>>>> > >>>>> > >>>>> > >>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner >>>>> > wrote: >>>>> > >>>>> >> Just submitted another Linear Regression attempt (0.16136). Added >>>>> some >>>>> >> features, both numeric and categorical, and created 3 numerics >>>>> >> >>>>> >> -TotalFullBaths >>>>> >> -TotalHalfBaths >>>>> >> -Pool >>>>> >> >>>>> >> Notebook attached >>>>> >> >>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner >>>>> wrote: >>>>> >> >>>>> >>> >>>>> >>> >>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner >>>>> >>> wrote: >>>>> >>> >>>>> >>>> >>>>> >>>> >>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >>>>> >>>> omaha at python.org> wrote: >>>>> >>>> >>>>> >>>>> >Does Kaggle take the high mark but still give a score for each >>>>> >>>>> submission? >>>>> >>>>> Yes. >>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >>>>> >>>>> chniques/submissions >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >Thinking of ways to keep track of which code produced which >>>>> score; >>>>> >>>>> I'll >>>>> >>>>> >post about the GitHub setup in a bit. >>>>> >>>>> We could push our notebooks to the github repo? Maybe include a >>>>> brief >>>>> >>>>> description at the top in a markdown cell >>>>> >>>>> >>>>> >>>> >>>>> >>>> In my research [1], I found that the preferred folder structure >>>>> for >>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/ >>>>> >>>> (outputs); >>>>> >>>> and that they recommend creating a settings.json with path >>>>> >>>> configuration (e.g. pointing to input/, src/ data/) >>>>> >>>> >>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. >>>>> >>>> >>>>> >>>> runipy is a bit more scriptable than requiring notebook gui >>>>> >>>> interactions [3]. >>>>> >>>> >>>>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb >>>>> >>>> sources, or we could write a function in src/data.py to read >>>>> >>>> '../settings.json' into a dict with the recommended variable >>>>> names [1]: >>>>> >>>> >>>>> >>>> from data import read_settings_json >>>>> >>>> settings = read_settings_json() >>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>>>> >>>> # .... >>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >>>>> >>>> >>>>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom >>>>> >>>> ment-267236556 >>>>> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/maste >>>>> r/src >>>>> >>>> [3] https://pypi.python.org/pypi/runipy >>>>> >>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>>> I initially thought github was a good way to go, but I don't >>>>> know if >>>>> >>>>> everyone has a github acct or is interested in starting one. >>>>> Maybe >>>>> >>>>> email >>>>> >>>>> is the way to go? >>>>> >>>>> >>>>> >>>> >>>>> >>>> I'm all for GitHub: >>>>> >>>> >>>>> >>>> - git source control and revision numbers >>>>> >>>> - we're not able to easily share code in the mailing list >>>>> >>>> - we can learn from each others' solutions >>>>> >>>> >>>>> >>> >>>>> >>> An example of mailing list limitations: >>>>> >>> >>>>> >>> >>>>> >>> Your mail to 'Omaha' with the subject >>>>> >>> >>>>> >>> Re: [omaha] Group Data Science Competition >>>>> >>> >>>>> >>> Is being held until the list moderator can review it for approval. >>>>> >>> >>>>> >>> The reason it is being held: >>>>> >>> >>>>> >>> Message body is too big: 47004 bytes with a limit of 40 KB >>>>> >>> >>>>> >>> (I trimmed out the reply chain; so this may make it through first) >>>>> >>> >>>>> >> >>>>> >> >>>>> > >>>>> _______________________________________________ >>>>> Omaha Python Users Group mailing list >>>>> Omaha at python.org >>>>> https://mail.python.org/mailman/listinfo/omaha >>>>> http://www.OmahaPython.org >>>>> >>>> >>>> >>> >> > From bob.haffner at gmail.com Sun Dec 18 19:25:41 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Sun, 18 Dec 2016 18:25:41 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Looks like it's not as simple as setting a value to True so I'll let you sort it out. On Sun, Dec 18, 2016 at 1:00 PM, Wes Turner wrote: > > On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner > wrote: > >> Wes, I can try to run your process with do_get_dummies=True. Anything >> else need to change? >> > > Yup, > > https://github.com/westurner/house_prices/blob/2839ff8a/ > house_prices/data.py#L94 : > > if do_get_dummies: > def get_categorical_columns(column_categories): > for colkey in column_categories: > values = column_categories[colkey] > if len(values): > yield colkey > categorical_columns = list(get_categorical_columns(column_categories)) > get_dummies_dict = {key: key for key in categorical_columns} > df = pd.get_dummies(df, prefix=get_dummies_dict, columns=get_dummies_dict) > > Needs to also be applied to train_csv and test_csv in the generated and > modified pipeline: > https://github.com/westurner/house_prices/blob/2839ff8a/ > house_prices/pipelines/tpot_house_prices__001__modified.py#L40 > > So, I can either copy/paste or factor or it out: > > - copy/paste: just wrong > - factor it out: > - this creates a (new) dependency on house_prices from within the > generated pipeline; which currently depends on [stable versions of] > (datacleaner, pandas, and scikit-learn) > > ... TODO: today > > - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df) > - [ ] Dockerfile > - probably the easiest way to reproduce the environment.yml > - [ ] automate the __modified.py patching process : > > > # git clone ssh://git at github.com/westurner/house_prices # -b > develop > conda env update -f ./environment.yml > cd house_prices/ > > python ./analysis.py > > # (wait) > > mv ./pipelines/tpot_house_prices_.py \ > ./pipelines/tpot_house_prices__002.py > mv ./pipelines/tpot_house_prices_.py.json \ > ./pipelines/tpot_house_prices__002.py.json > cp ./pipelines/tpot_house_prices__001__modified.py \ > ./pipelines/tpot_house_prices__002__modified.py > # copy/paste (TODO: patch/template): > # - exported_pipeline / self.exported_pipeline > # - the sklearn imports] to __002__modified.py > cd pipelines/ # TODO: settings.json > python ./tpot_house_prices__002__modified.py > > > ... The modified pipeline generation is not quite reproducible yet, but > the generated pipeline (tpot_house_prices__001[__modified].py) is. (With > ~2% error ... only about ~$6mn dollars off :|) > > > >> Sent from my iPhone >> >> On Dec 18, 2016, at 10:59 AM, Wes Turner wrote: >> >> Thanks, Bob! >> >> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner >> wrote: >> >>> Nice job, Wes!! >>> >>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner >>> wrote: >>> >>>> In addition to posting to the mailing list, I created a comment on the >>>> "Kaggle Submissions" issue [1]: >>>> >>>> - Score: 0.13667 (#1370) >>>>> - https://www.kaggle.com/c/house-prices-advanced-regression-te >>>>> chniques/leaderboard?submissionId=3925119 >>>>> - https://mail.python.org/pipermail/omaha/2016-December/002206.html >>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>>> >>>> >>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 >>>> >>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner >>>> wrote: >>>> >>>>> Sounds great. 1/18. >>>>> >>>>> I just submitted my first submission.csv to Kaggle! [1] >>>>> >>>>> $ python ./tpot_house_prices__001__modified.py >>>>> class_sum: 264144946 >>>>> abs error: 5582809.288 >>>>> % error: 2.11354007432 % >>>>> error**2: 252508654837.0 >>>>> # python ./tpot_house_prices__001__modified.py >>>>> >>>>> >>>>> ... Which moves us up to #1370! >>>>> >>>>> Your Best Entry ? >>>>> You improved on your best score by 0.02469. >>>>> You just moved up 608 positions on the leaderboard. >>>>> >>>>> >>>>> I have a few more things to try: >>>>> >>>>> >>>>> - Manually drop the 'Id' column >>>>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance >>>>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ >>>>> verbosity=2) >>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>> e_prices/data.py#L94 >>>>> - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters >>>>> - http://scikit-learn.org/stable/modules/generated/sklearn.mod >>>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridS >>>>> earchCV >>>>> >>>>> - https://github.com/rsteca/sklearn-deap >>>>> - REF,BLD,DOC,TST: >>>>> - factor constants out in favor of settings.json and data.py >>>>> - https://github.com/omahapython >>>>> /kaggle-houseprices/blob/master/src/data.py >>>>> >>>>> - implement train.py and predict.py, too >>>>> - create a Dockerfile FROM kaggle/docker-python:latest >>>>> - https://github.com/omahapython/datascience/issues/3 >>>>> "Kaggle Best Practices" >>>>> - docstrings, tests >>>>> - https://github.com/omahapython/datascience/wiki/resources >>>>> >>>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>>>> >>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha < >>>>> omaha at python.org> wrote: >>>>> >>>>>> Hey all, regarding our January kaggle meetup that we talked about. >>>>>> Maybe >>>>>> we can meet following our regular monthly (1/18). >>>>>> >>>>>> Would that be easier/better for everyone? >>>>>> >>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner >>>>>> wrote: >>>>>> >>>>>> > Just submitted another Linear Regression attempt (0.16136). Added >>>>>> some >>>>>> > features, both numeric and categorical, and created 3 numerics >>>>>> > >>>>>> > -TotalFullBaths >>>>>> > -TotalHalfBaths >>>>>> > -Pool >>>>>> > >>>>>> > Notebook attached >>>>>> > >>>>>> > >>>>>> > >>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner >>>>> > >>>>>> > wrote: >>>>>> > >>>>>> >> Just submitted another Linear Regression attempt (0.16136). Added >>>>>> some >>>>>> >> features, both numeric and categorical, and created 3 numerics >>>>>> >> >>>>>> >> -TotalFullBaths >>>>>> >> -TotalHalfBaths >>>>>> >> -Pool >>>>>> >> >>>>>> >> Notebook attached >>>>>> >> >>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner >>>>>> wrote: >>>>>> >> >>>>>> >>> >>>>>> >>> >>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner >>>>> > >>>>>> >>> wrote: >>>>>> >>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >>>>>> >>>> omaha at python.org> wrote: >>>>>> >>>> >>>>>> >>>>> >Does Kaggle take the high mark but still give a score for each >>>>>> >>>>> submission? >>>>>> >>>>> Yes. >>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >>>>>> >>>>> chniques/submissions >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >Thinking of ways to keep track of which code produced which >>>>>> score; >>>>>> >>>>> I'll >>>>>> >>>>> >post about the GitHub setup in a bit. >>>>>> >>>>> We could push our notebooks to the github repo? Maybe include >>>>>> a brief >>>>>> >>>>> description at the top in a markdown cell >>>>>> >>>>> >>>>>> >>>> >>>>>> >>>> In my research [1], I found that the preferred folder structure >>>>>> for >>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and >>>>>> working/ >>>>>> >>>> (outputs); >>>>>> >>>> and that they recommend creating a settings.json with path >>>>>> >>>> configuration (e.g. pointing to input/, src/ data/) >>>>>> >>>> >>>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. >>>>>> >>>> >>>>>> >>>> runipy is a bit more scriptable than requiring notebook gui >>>>>> >>>> interactions [3]. >>>>>> >>>> >>>>>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb >>>>>> >>>> sources, or we could write a function in src/data.py to read >>>>>> >>>> '../settings.json' into a dict with the recommended variable >>>>>> names [1]: >>>>>> >>>> >>>>>> >>>> from data import read_settings_json >>>>>> >>>> settings = read_settings_json() >>>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>>>>> >>>> # .... >>>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >>>>>> >>>> >>>>>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom >>>>>> >>>> ment-267236556 >>>>>> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/maste >>>>>> r/src >>>>>> >>>> [3] https://pypi.python.org/pypi/runipy >>>>>> >>>> >>>>>> >>>> >>>>>> >>>>> >>>>>> >>>>> I initially thought github was a good way to go, but I don't >>>>>> know if >>>>>> >>>>> everyone has a github acct or is interested in starting one. >>>>>> Maybe >>>>>> >>>>> email >>>>>> >>>>> is the way to go? >>>>>> >>>>> >>>>>> >>>> >>>>>> >>>> I'm all for GitHub: >>>>>> >>>> >>>>>> >>>> - git source control and revision numbers >>>>>> >>>> - we're not able to easily share code in the mailing list >>>>>> >>>> - we can learn from each others' solutions >>>>>> >>>> >>>>>> >>> >>>>>> >>> An example of mailing list limitations: >>>>>> >>> >>>>>> >>> >>>>>> >>> Your mail to 'Omaha' with the subject >>>>>> >>> >>>>>> >>> Re: [omaha] Group Data Science Competition >>>>>> >>> >>>>>> >>> Is being held until the list moderator can review it for approval. >>>>>> >>> >>>>>> >>> The reason it is being held: >>>>>> >>> >>>>>> >>> Message body is too big: 47004 bytes with a limit of 40 KB >>>>>> >>> >>>>>> >>> (I trimmed out the reply chain; so this may make it through >>>>>> first) >>>>>> >>> >>>>>> >> >>>>>> >> >>>>>> > >>>>>> _______________________________________________ >>>>>> Omaha Python Users Group mailing list >>>>>> Omaha at python.org >>>>>> https://mail.python.org/mailman/listinfo/omaha >>>>>> http://www.OmahaPython.org >>>>>> >>>>> >>>>> >>>> >>> >> > From wereapwhatwesow at gmail.com Sun Dec 18 19:46:35 2016 From: wereapwhatwesow at gmail.com (Steve Young) Date: Sun, 18 Dec 2016 18:46:35 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: Thanks for setting this up Wes. Please add me to both: username - sceva, omahapython, @omahapython/datascience On Sun, Dec 18, 2016 at 10:59 AM, Wes Turner via Omaha wrote: > On Sat, Dec 17, 2016 at 5:06 PM, Adam Schaal via Omaha > wrote: > > > I'd be interested in joining. I just wrapped school, so should be more > > available for meetings and such going forward. > > > > Github username: clevernyyyy > > > > Thanks! > > > > I sent an invitation to the omahapython group. > Do you also want to be added to the omahapython/datascience group? > > > > > > Adam Schaal > > > > On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha > > wrote: > > > > > I've created: > > > > > > - an omahapython github organization account: > > > https://github.com/omahapython > > > > > > - an @omahapython/datascience team: > > > https://github.com/orgs/omahapython/teams/datascience > > > > > > - an omahapython/datascience repository > > > https://github.com/omahapython/datascience- > > > > > > - omahapython/datascience#3: "Kaggle Best Practices" > > > https://github.com/omahapython/datascience/issues/3 > > > > > > - an omahapython/kaggle-houseprices repository > > > https://github.com/omahapython/kaggle-houseprices > > > > > > - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" > > > https://github.com/omahapython/kaggle-houseprices/issues/1 > > > > > > REQ (Request): Please reply with your github username if you want to be > > > added to the omahapython org and/or the omahapython/datascience team. > > All I > > > need is either: > > > > > > username, omahapython > > > > > > username, omahapython, @omahapython/datascience > > > > > > > "Use @omahapython/datascience to mention this team in comments." > > > > > > The @omahapython/datascience team has write access to > kaggle-houseprices > > > (where I'll soon create the recommended kaggle competition folder > > structure > > > compiled in omahapython/datascience#3). > > > _______________________________________________ > > > Omaha Python Users Group mailing list > > > Omaha at python.org > > > https://mail.python.org/mailman/listinfo/omaha > > > http://www.OmahaPython.org > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From rob.townley at gmail.com Sun Dec 18 19:58:51 2016 From: rob.townley at gmail.com (Rob Townley) Date: Sun, 18 Dec 2016 18:58:51 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: username - rjt, omahapython, @omahapython/datascience On Dec 18, 2016 6:51 PM, "Steve Young via Omaha" wrote: > Thanks for setting this up Wes. Please add me to both: > > username - sceva, omahapython, @omahapython/datascience > > On Sun, Dec 18, 2016 at 10:59 AM, Wes Turner via Omaha > wrote: > > > On Sat, Dec 17, 2016 at 5:06 PM, Adam Schaal via Omaha > > > wrote: > > > > > I'd be interested in joining. I just wrapped school, so should be more > > > available for meetings and such going forward. > > > > > > Github username: clevernyyyy > > > > > > Thanks! > > > > > > > I sent an invitation to the omahapython group. > > Do you also want to be added to the omahapython/datascience group? > > > > > > > > > > Adam Schaal > > > > > > On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha < > omaha at python.org> > > > wrote: > > > > > > > I've created: > > > > > > > > - an omahapython github organization account: > > > > https://github.com/omahapython > > > > > > > > - an @omahapython/datascience team: > > > > https://github.com/orgs/omahapython/teams/datascience > > > > > > > > - an omahapython/datascience repository > > > > https://github.com/omahapython/datascience- > > > > > > > > - omahapython/datascience#3: "Kaggle Best Practices" > > > > https://github.com/omahapython/datascience/issues/3 > > > > > > > > - an omahapython/kaggle-houseprices repository > > > > https://github.com/omahapython/kaggle-houseprices > > > > > > > > - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" > > > > https://github.com/omahapython/kaggle-houseprices/issues/1 > > > > > > > > REQ (Request): Please reply with your github username if you want to > be > > > > added to the omahapython org and/or the omahapython/datascience team. > > > All I > > > > need is either: > > > > > > > > username, omahapython > > > > > > > > username, omahapython, @omahapython/datascience > > > > > > > > > "Use @omahapython/datascience to mention this team in comments." > > > > > > > > The @omahapython/datascience team has write access to > > kaggle-houseprices > > > > (where I'll soon create the recommended kaggle competition folder > > > structure > > > > compiled in omahapython/datascience#3). > > > > _______________________________________________ > > > > Omaha Python Users Group mailing list > > > > Omaha at python.org > > > > https://mail.python.org/mailman/listinfo/omaha > > > > http://www.OmahaPython.org > > > > > > > _______________________________________________ > > > Omaha Python Users Group mailing list > > > Omaha at python.org > > > https://mail.python.org/mailman/listinfo/omaha > > > http://www.OmahaPython.org > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sun Dec 18 20:17:47 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 18 Dec 2016 19:17:47 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: On Sun, Dec 18, 2016 at 6:46 PM, Steve Young via Omaha wrote: > Thanks for setting this up Wes. Please add me to both: > > username - sceva, omahapython, @omahapython/datascience > yw! invitation sent. > > On Sun, Dec 18, 2016 at 10:59 AM, Wes Turner via Omaha > wrote: > > > On Sat, Dec 17, 2016 at 5:06 PM, Adam Schaal via Omaha > > > wrote: > > > > > I'd be interested in joining. I just wrapped school, so should be more > > > available for meetings and such going forward. > > > > > > Github username: clevernyyyy > > > > > > Thanks! > > > > > > > I sent an invitation to the omahapython group. > > Do you also want to be added to the omahapython/datascience group? > > > > > > > > > > Adam Schaal > > > > > > On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha < > omaha at python.org> > > > wrote: > > > > > > > I've created: > > > > > > > > - an omahapython github organization account: > > > > https://github.com/omahapython > > > > > > > > - an @omahapython/datascience team: > > > > https://github.com/orgs/omahapython/teams/datascience > > > > > > > > - an omahapython/datascience repository > > > > https://github.com/omahapython/datascience- > > > > > > > > - omahapython/datascience#3: "Kaggle Best Practices" > > > > https://github.com/omahapython/datascience/issues/3 > > > > > > > > - an omahapython/kaggle-houseprices repository > > > > https://github.com/omahapython/kaggle-houseprices > > > > > > > > - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" > > > > https://github.com/omahapython/kaggle-houseprices/issues/1 > > > > > > > > REQ (Request): Please reply with your github username if you want to > be > > > > added to the omahapython org and/or the omahapython/datascience team. > > > All I > > > > need is either: > > > > > > > > username, omahapython > > > > > > > > username, omahapython, @omahapython/datascience > > > > > > > > > "Use @omahapython/datascience to mention this team in comments." > > > > > > > > The @omahapython/datascience team has write access to > > kaggle-houseprices > > > > (where I'll soon create the recommended kaggle competition folder > > > structure > > > > compiled in omahapython/datascience#3). > > > > _______________________________________________ > > > > Omaha Python Users Group mailing list > > > > Omaha at python.org > > > > https://mail.python.org/mailman/listinfo/omaha > > > > http://www.OmahaPython.org > > > > > > > _______________________________________________ > > > Omaha Python Users Group mailing list > > > Omaha at python.org > > > https://mail.python.org/mailman/listinfo/omaha > > > http://www.OmahaPython.org > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Sun Dec 18 20:18:01 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 18 Dec 2016 19:18:01 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: On Sun, Dec 18, 2016 at 6:58 PM, Rob Townley via Omaha wrote: > username - rjt, omahapython, @omahapython/datascience > invitation sent. > > On Dec 18, 2016 6:51 PM, "Steve Young via Omaha" wrote: > > > Thanks for setting this up Wes. Please add me to both: > > > > username - sceva, omahapython, @omahapython/datascience > > > > On Sun, Dec 18, 2016 at 10:59 AM, Wes Turner via Omaha > > > wrote: > > > > > On Sat, Dec 17, 2016 at 5:06 PM, Adam Schaal via Omaha < > omaha at python.org > > > > > > wrote: > > > > > > > I'd be interested in joining. I just wrapped school, so should be > more > > > > available for meetings and such going forward. > > > > > > > > Github username: clevernyyyy > > > > > > > > Thanks! > > > > > > > > > > I sent an invitation to the omahapython group. > > > Do you also want to be added to the omahapython/datascience group? > > > > > > > > > > > > > > Adam Schaal > > > > > > > > On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha < > > omaha at python.org> > > > > wrote: > > > > > > > > > I've created: > > > > > > > > > > - an omahapython github organization account: > > > > > https://github.com/omahapython > > > > > > > > > > - an @omahapython/datascience team: > > > > > https://github.com/orgs/omahapython/teams/datascience > > > > > > > > > > - an omahapython/datascience repository > > > > > https://github.com/omahapython/datascience- > > > > > > > > > > - omahapython/datascience#3: "Kaggle Best Practices" > > > > > https://github.com/omahapython/datascience/issues/3 > > > > > > > > > > - an omahapython/kaggle-houseprices repository > > > > > https://github.com/omahapython/kaggle-houseprices > > > > > > > > > > - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" > > > > > https://github.com/omahapython/kaggle-houseprices/issues/1 > > > > > > > > > > REQ (Request): Please reply with your github username if you want > to > > be > > > > > added to the omahapython org and/or the omahapython/datascience > team. > > > > All I > > > > > need is either: > > > > > > > > > > username, omahapython > > > > > > > > > > username, omahapython, @omahapython/datascience > > > > > > > > > > > "Use @omahapython/datascience to mention this team in comments." > > > > > > > > > > The @omahapython/datascience team has write access to > > > kaggle-houseprices > > > > > (where I'll soon create the recommended kaggle competition folder > > > > structure > > > > > compiled in omahapython/datascience#3). > > > > > _______________________________________________ > > > > > Omaha Python Users Group mailing list > > > > > Omaha at python.org > > > > > https://mail.python.org/mailman/listinfo/omaha > > > > > http://www.OmahaPython.org > > > > > > > > > _______________________________________________ > > > > Omaha Python Users Group mailing list > > > > Omaha at python.org > > > > https://mail.python.org/mailman/listinfo/omaha > > > > http://www.OmahaPython.org > > > > > > > _______________________________________________ > > > Omaha Python Users Group mailing list > > > Omaha at python.org > > > https://mail.python.org/mailman/listinfo/omaha > > > http://www.OmahaPython.org > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bob.haffner at gmail.com Mon Dec 19 22:46:14 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Mon, 19 Dec 2016 21:46:14 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Hi All, I submitted another earlier today, but did not improve upon Wes' submission. I used the same feature set from my Saturday submission just tried some different regressors (lasso, elastic and Random Forest Regressor). I also did some cross validation. Link to my latest notebook on github https://github.com/bobhaffner/kaggle-houseprices/blob/master/kaggle_house_prices.ipynb Bob On Sun, Dec 18, 2016 at 6:25 PM, Bob Haffner wrote: > Looks like it's not as simple as setting a value to True so I'll let you > sort it out. > > > > On Sun, Dec 18, 2016 at 1:00 PM, Wes Turner wrote: > >> >> On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner >> wrote: >> >>> Wes, I can try to run your process with do_get_dummies=True. Anything >>> else need to change? >>> >> >> Yup, >> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous >> e_prices/data.py#L94 : >> >> if do_get_dummies: >> def get_categorical_columns(column_categories): >> for colkey in column_categories: >> values = column_categories[colkey] >> if len(values): >> yield colkey >> categorical_columns = list(get_categorical_columns(column_categories)) >> get_dummies_dict = {key: key for key in categorical_columns} >> df = pd.get_dummies(df, prefix=get_dummies_dict, columns=get_dummies_dict) >> >> Needs to also be applied to train_csv and test_csv in the generated and >> modified pipeline: >> https://github.com/westurner/house_prices/blob/2839ff8a/hous >> e_prices/pipelines/tpot_house_prices__001__modified.py#L40 >> >> So, I can either copy/paste or factor or it out: >> >> - copy/paste: just wrong >> - factor it out: >> - this creates a (new) dependency on house_prices from within the >> generated pipeline; which currently depends on [stable versions of] >> (datacleaner, pandas, and scikit-learn) >> >> ... TODO: today >> >> - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df) >> - [ ] Dockerfile >> - probably the easiest way to reproduce the environment.yml >> - [ ] automate the __modified.py patching process : >> >> >> # git clone ssh://git at github.com/westurner/house_prices # -b >> develop >> conda env update -f ./environment.yml >> cd house_prices/ >> >> python ./analysis.py >> >> # (wait) >> >> mv ./pipelines/tpot_house_prices_.py \ >> ./pipelines/tpot_house_prices__002.py >> mv ./pipelines/tpot_house_prices_.py.json \ >> ./pipelines/tpot_house_prices__002.py.json >> cp ./pipelines/tpot_house_prices__001__modified.py \ >> ./pipelines/tpot_house_prices__002__modified.py >> # copy/paste (TODO: patch/template): >> # - exported_pipeline / self.exported_pipeline >> # - the sklearn imports] to __002__modified.py >> cd pipelines/ # TODO: settings.json >> python ./tpot_house_prices__002__modified.py >> >> >> ... The modified pipeline generation is not quite reproducible yet, but >> the generated pipeline (tpot_house_prices__001[__modified].py) is. (With >> ~2% error ... only about ~$6mn dollars off :|) >> >> >> >>> Sent from my iPhone >>> >>> On Dec 18, 2016, at 10:59 AM, Wes Turner wrote: >>> >>> Thanks, Bob! >>> >>> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner >>> wrote: >>> >>>> Nice job, Wes!! >>>> >>>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner >>>> wrote: >>>> >>>>> In addition to posting to the mailing list, I created a comment on the >>>>> "Kaggle Submissions" issue [1]: >>>>> >>>>> - Score: 0.13667 (#1370) >>>>>> - https://www.kaggle.com/c/house-prices-advanced-regression-te >>>>>> chniques/leaderboard?submissionId=3925119 >>>>>> - https://mail.python.org/pipermail/omaha/2016-December/002206.html >>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>>>> >>>>> >>>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 >>>>> >>>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner >>>>> wrote: >>>>> >>>>>> Sounds great. 1/18. >>>>>> >>>>>> I just submitted my first submission.csv to Kaggle! [1] >>>>>> >>>>>> $ python ./tpot_house_prices__001__modified.py >>>>>> class_sum: 264144946 >>>>>> abs error: 5582809.288 >>>>>> % error: 2.11354007432 % >>>>>> error**2: 252508654837.0 >>>>>> # python ./tpot_house_prices__001__modified.py >>>>>> >>>>>> >>>>>> ... Which moves us up to #1370! >>>>>> >>>>>> Your Best Entry ? >>>>>> You improved on your best score by 0.02469. >>>>>> You just moved up 608 positions on the leaderboard. >>>>>> >>>>>> >>>>>> I have a few more things to try: >>>>>> >>>>>> >>>>>> - Manually drop the 'Id' column >>>>>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance >>>>>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ >>>>>> verbosity=2) >>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>>> e_prices/data.py#L94 >>>>>> - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters >>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn.mod >>>>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridS >>>>>> earchCV >>>>>> >>>>>> - https://github.com/rsteca/sklearn-deap >>>>>> - REF,BLD,DOC,TST: >>>>>> - factor constants out in favor of settings.json and data.py >>>>>> - https://github.com/omahapython >>>>>> /kaggle-houseprices/blob/master/src/data.py >>>>>> >>>>>> - implement train.py and predict.py, too >>>>>> - create a Dockerfile FROM kaggle/docker-python:latest >>>>>> - https://github.com/omahapython/datascience/issues/3 >>>>>> "Kaggle Best Practices" >>>>>> - docstrings, tests >>>>>> - https://github.com/omahapython/datascience/wiki/resources >>>>>> >>>>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>>>>> >>>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha < >>>>>> omaha at python.org> wrote: >>>>>> >>>>>>> Hey all, regarding our January kaggle meetup that we talked about. >>>>>>> Maybe >>>>>>> we can meet following our regular monthly (1/18). >>>>>>> >>>>>>> Would that be easier/better for everyone? >>>>>>> >>>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner >>>>>>> wrote: >>>>>>> >>>>>>> > Just submitted another Linear Regression attempt (0.16136). Added >>>>>>> some >>>>>>> > features, both numeric and categorical, and created 3 numerics >>>>>>> > >>>>>>> > -TotalFullBaths >>>>>>> > -TotalHalfBaths >>>>>>> > -Pool >>>>>>> > >>>>>>> > Notebook attached >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner < >>>>>>> bob.haffner at gmail.com> >>>>>>> > wrote: >>>>>>> > >>>>>>> >> Just submitted another Linear Regression attempt (0.16136). >>>>>>> Added some >>>>>>> >> features, both numeric and categorical, and created 3 numerics >>>>>>> >> >>>>>>> >> -TotalFullBaths >>>>>>> >> -TotalHalfBaths >>>>>>> >> -Pool >>>>>>> >> >>>>>>> >> Notebook attached >>>>>>> >> >>>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner >>>>>>> wrote: >>>>>>> >> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner < >>>>>>> wes.turner at gmail.com> >>>>>>> >>> wrote: >>>>>>> >>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >>>>>>> >>>> omaha at python.org> wrote: >>>>>>> >>>> >>>>>>> >>>>> >Does Kaggle take the high mark but still give a score for each >>>>>>> >>>>> submission? >>>>>>> >>>>> Yes. >>>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >>>>>>> >>>>> chniques/submissions >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >Thinking of ways to keep track of which code produced which >>>>>>> score; >>>>>>> >>>>> I'll >>>>>>> >>>>> >post about the GitHub setup in a bit. >>>>>>> >>>>> We could push our notebooks to the github repo? Maybe include >>>>>>> a brief >>>>>>> >>>>> description at the top in a markdown cell >>>>>>> >>>>> >>>>>>> >>>> >>>>>>> >>>> In my research [1], I found that the preferred folder structure >>>>>>> for >>>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and >>>>>>> working/ >>>>>>> >>>> (outputs); >>>>>>> >>>> and that they recommend creating a settings.json with path >>>>>>> >>>> configuration (e.g. pointing to input/, src/ data/) >>>>>>> >>>> >>>>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. >>>>>>> >>>> >>>>>>> >>>> runipy is a bit more scriptable than requiring notebook gui >>>>>>> >>>> interactions [3]. >>>>>>> >>>> >>>>>>> >>>> We could either hardcode '../input/test.csv' in our .py and >>>>>>> .ipnb >>>>>>> >>>> sources, or we could write a function in src/data.py to read >>>>>>> >>>> '../settings.json' into a dict with the recommended variable >>>>>>> names [1]: >>>>>>> >>>> >>>>>>> >>>> from data import read_settings_json >>>>>>> >>>> settings = read_settings_json() >>>>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>>>>>> >>>> # .... >>>>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >>>>>>> >>>> >>>>>>> >>>> [1] https://github.com/omahapython >>>>>>> /datascience/issues/3#issuecom >>>>>>> >>>> ment-267236556 >>>>>>> >>>> [2] https://github.com/omahapython >>>>>>> /kaggle-houseprices/tree/master/src >>>>>>> >>>> [3] https://pypi.python.org/pypi/runipy >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>>> >>>>>>> >>>>> I initially thought github was a good way to go, but I don't >>>>>>> know if >>>>>>> >>>>> everyone has a github acct or is interested in starting one. >>>>>>> Maybe >>>>>>> >>>>> email >>>>>>> >>>>> is the way to go? >>>>>>> >>>>> >>>>>>> >>>> >>>>>>> >>>> I'm all for GitHub: >>>>>>> >>>> >>>>>>> >>>> - git source control and revision numbers >>>>>>> >>>> - we're not able to easily share code in the mailing list >>>>>>> >>>> - we can learn from each others' solutions >>>>>>> >>>> >>>>>>> >>> >>>>>>> >>> An example of mailing list limitations: >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> Your mail to 'Omaha' with the subject >>>>>>> >>> >>>>>>> >>> Re: [omaha] Group Data Science Competition >>>>>>> >>> >>>>>>> >>> Is being held until the list moderator can review it for >>>>>>> approval. >>>>>>> >>> >>>>>>> >>> The reason it is being held: >>>>>>> >>> >>>>>>> >>> Message body is too big: 47004 bytes with a limit of 40 KB >>>>>>> >>> >>>>>>> >>> (I trimmed out the reply chain; so this may make it through >>>>>>> first) >>>>>>> >>> >>>>>>> >> >>>>>>> >> >>>>>>> > >>>>>>> _______________________________________________ >>>>>>> Omaha Python Users Group mailing list >>>>>>> Omaha at python.org >>>>>>> https://mail.python.org/mailman/listinfo/omaha >>>>>>> http://www.OmahaPython.org >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > From bob.haffner at gmail.com Tue Dec 20 21:55:11 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Tue, 20 Dec 2016 20:55:11 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Hi All, I put together a quick histogram of the leaderboard showing the distribution of the scores under .25. It's a tight race! https://github.com/bobhaffner/kaggle-houseprices/blob/master/kaggle_house_prices_leaderboard.ipynb On Mon, Dec 19, 2016 at 9:46 PM, Bob Haffner wrote: > Hi All, I submitted another earlier today, but did not improve upon Wes' > submission. > > I used the same feature set from my Saturday submission just tried some > different regressors (lasso, elastic and Random Forest Regressor). I also > did some cross validation. > > Link to my latest notebook on github > https://github.com/bobhaffner/kaggle-houseprices/blob/ > master/kaggle_house_prices.ipynb > > Bob > > On Sun, Dec 18, 2016 at 6:25 PM, Bob Haffner > wrote: > >> Looks like it's not as simple as setting a value to True so I'll let you >> sort it out. >> >> >> >> On Sun, Dec 18, 2016 at 1:00 PM, Wes Turner wrote: >> >>> >>> On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner >>> wrote: >>> >>>> Wes, I can try to run your process with do_get_dummies=True. Anything >>>> else need to change? >>>> >>> >>> Yup, >>> >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>> e_prices/data.py#L94 : >>> >>> if do_get_dummies: >>> def get_categorical_columns(column_categories): >>> for colkey in column_categories: >>> values = column_categories[colkey] >>> if len(values): >>> yield colkey >>> categorical_columns = list(get_categorical_columns(column_categories)) >>> get_dummies_dict = {key: key for key in categorical_columns} >>> df = pd.get_dummies(df, prefix=get_dummies_dict, columns=get_dummies_dict) >>> >>> Needs to also be applied to train_csv and test_csv in the generated and >>> modified pipeline: >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>> e_prices/pipelines/tpot_house_prices__001__modified.py#L40 >>> >>> So, I can either copy/paste or factor or it out: >>> >>> - copy/paste: just wrong >>> - factor it out: >>> - this creates a (new) dependency on house_prices from within the >>> generated pipeline; which currently depends on [stable versions of] >>> (datacleaner, pandas, and scikit-learn) >>> >>> ... TODO: today >>> >>> - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df) >>> - [ ] Dockerfile >>> - probably the easiest way to reproduce the environment.yml >>> - [ ] automate the __modified.py patching process : >>> >>> >>> # git clone ssh://git at github.com/westurner/house_prices # >>> -b develop >>> conda env update -f ./environment.yml >>> cd house_prices/ >>> >>> python ./analysis.py >>> >>> # (wait) >>> >>> mv ./pipelines/tpot_house_prices_.py \ >>> ./pipelines/tpot_house_prices__002.py >>> mv ./pipelines/tpot_house_prices_.py.json \ >>> ./pipelines/tpot_house_prices__002.py.json >>> cp ./pipelines/tpot_house_prices__001__modified.py \ >>> ./pipelines/tpot_house_prices__002__modified.py >>> # copy/paste (TODO: patch/template): >>> # - exported_pipeline / self.exported_pipeline >>> # - the sklearn imports] to __002__modified.py >>> cd pipelines/ # TODO: settings.json >>> python ./tpot_house_prices__002__modified.py >>> >>> >>> ... The modified pipeline generation is not quite reproducible yet, but >>> the generated pipeline (tpot_house_prices__001[__modified].py) is. >>> (With ~2% error ... only about ~$6mn dollars off :|) >>> >>> >>> >>>> Sent from my iPhone >>>> >>>> On Dec 18, 2016, at 10:59 AM, Wes Turner wrote: >>>> >>>> Thanks, Bob! >>>> >>>> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner >>>> wrote: >>>> >>>>> Nice job, Wes!! >>>>> >>>>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner >>>>> wrote: >>>>> >>>>>> In addition to posting to the mailing list, I created a comment on >>>>>> the "Kaggle Submissions" issue [1]: >>>>>> >>>>>> - Score: 0.13667 (#1370) >>>>>>> - https://www.kaggle.com/c/house-prices-advanced-regression-te >>>>>>> chniques/leaderboard?submissionId=3925119 >>>>>>> - https://mail.python.org/pipermail/omaha/2016-December/002206 >>>>>>> .html >>>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>>>>> >>>>>> >>>>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 >>>>>> >>>>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner >>>>>> wrote: >>>>>> >>>>>>> Sounds great. 1/18. >>>>>>> >>>>>>> I just submitted my first submission.csv to Kaggle! [1] >>>>>>> >>>>>>> $ python ./tpot_house_prices__001__modified.py >>>>>>> class_sum: 264144946 >>>>>>> abs error: 5582809.288 >>>>>>> % error: 2.11354007432 % >>>>>>> error**2: 252508654837.0 >>>>>>> # python ./tpot_house_prices__001__modified.py >>>>>>> >>>>>>> >>>>>>> ... Which moves us up to #1370! >>>>>>> >>>>>>> Your Best Entry ? >>>>>>> You improved on your best score by 0.02469. >>>>>>> You just moved up 608 positions on the leaderboard. >>>>>>> >>>>>>> >>>>>>> I have a few more things to try: >>>>>>> >>>>>>> >>>>>>> - Manually drop the 'Id' column >>>>>>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance >>>>>>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ >>>>>>> verbosity=2) >>>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>>>> e_prices/data.py#L94 >>>>>>> - skleanGridSearch and/or sklearn-deap the TPOT >>>>>>> hyperparameters >>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn.mod >>>>>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridS >>>>>>> earchCV >>>>>>> >>>>>>> - https://github.com/rsteca/sklearn-deap >>>>>>> - REF,BLD,DOC,TST: >>>>>>> - factor constants out in favor of settings.json and data.py >>>>>>> - https://github.com/omahapython >>>>>>> /kaggle-houseprices/blob/master/src/data.py >>>>>>> >>>>>>> - implement train.py and predict.py, too >>>>>>> - create a Dockerfile FROM kaggle/docker-python:latest >>>>>>> - https://github.com/omahapython/datascience/issues/3 >>>>>>> "Kaggle Best Practices" >>>>>>> - docstrings, tests >>>>>>> - https://github.com/omahapython/datascience/wiki/resources >>>>>>> >>>>>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>>>>>> >>>>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha < >>>>>>> omaha at python.org> wrote: >>>>>>> >>>>>>>> Hey all, regarding our January kaggle meetup that we talked about. >>>>>>>> Maybe >>>>>>>> we can meet following our regular monthly (1/18). >>>>>>>> >>>>>>>> Would that be easier/better for everyone? >>>>>>>> >>>>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner >>>>>>>> wrote: >>>>>>>> >>>>>>>> > Just submitted another Linear Regression attempt (0.16136). >>>>>>>> Added some >>>>>>>> > features, both numeric and categorical, and created 3 numerics >>>>>>>> > >>>>>>>> > -TotalFullBaths >>>>>>>> > -TotalHalfBaths >>>>>>>> > -Pool >>>>>>>> > >>>>>>>> > Notebook attached >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner < >>>>>>>> bob.haffner at gmail.com> >>>>>>>> > wrote: >>>>>>>> > >>>>>>>> >> Just submitted another Linear Regression attempt (0.16136). >>>>>>>> Added some >>>>>>>> >> features, both numeric and categorical, and created 3 numerics >>>>>>>> >> >>>>>>>> >> -TotalFullBaths >>>>>>>> >> -TotalHalfBaths >>>>>>>> >> -Pool >>>>>>>> >> >>>>>>>> >> Notebook attached >>>>>>>> >> >>>>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner < >>>>>>>> wes.turner at gmail.com> wrote: >>>>>>>> >> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner < >>>>>>>> wes.turner at gmail.com> >>>>>>>> >>> wrote: >>>>>>>> >>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >>>>>>>> >>>> omaha at python.org> wrote: >>>>>>>> >>>> >>>>>>>> >>>>> >Does Kaggle take the high mark but still give a score for >>>>>>>> each >>>>>>>> >>>>> submission? >>>>>>>> >>>>> Yes. >>>>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te >>>>>>>> >>>>> chniques/submissions >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> >Thinking of ways to keep track of which code produced which >>>>>>>> score; >>>>>>>> >>>>> I'll >>>>>>>> >>>>> >post about the GitHub setup in a bit. >>>>>>>> >>>>> We could push our notebooks to the github repo? Maybe >>>>>>>> include a brief >>>>>>>> >>>>> description at the top in a markdown cell >>>>>>>> >>>>> >>>>>>>> >>>> >>>>>>>> >>>> In my research [1], I found that the preferred folder >>>>>>>> structure for >>>>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and >>>>>>>> working/ >>>>>>>> >>>> (outputs); >>>>>>>> >>>> and that they recommend creating a settings.json with path >>>>>>>> >>>> configuration (e.g. pointing to input/, src/ data/) >>>>>>>> >>>> >>>>>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. >>>>>>>> >>>> >>>>>>>> >>>> runipy is a bit more scriptable than requiring notebook gui >>>>>>>> >>>> interactions [3]. >>>>>>>> >>>> >>>>>>>> >>>> We could either hardcode '../input/test.csv' in our .py and >>>>>>>> .ipnb >>>>>>>> >>>> sources, or we could write a function in src/data.py to read >>>>>>>> >>>> '../settings.json' into a dict with the recommended variable >>>>>>>> names [1]: >>>>>>>> >>>> >>>>>>>> >>>> from data import read_settings_json >>>>>>>> >>>> settings = read_settings_json() >>>>>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>>>>>>> >>>> # .... >>>>>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >>>>>>>> >>>> >>>>>>>> >>>> [1] https://github.com/omahapython >>>>>>>> /datascience/issues/3#issuecom >>>>>>>> >>>> ment-267236556 >>>>>>>> >>>> [2] https://github.com/omahapython >>>>>>>> /kaggle-houseprices/tree/master/src >>>>>>>> >>>> [3] https://pypi.python.org/pypi/runipy >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>>> >>>>>>>> >>>>> I initially thought github was a good way to go, but I don't >>>>>>>> know if >>>>>>>> >>>>> everyone has a github acct or is interested in starting one. >>>>>>>> Maybe >>>>>>>> >>>>> email >>>>>>>> >>>>> is the way to go? >>>>>>>> >>>>> >>>>>>>> >>>> >>>>>>>> >>>> I'm all for GitHub: >>>>>>>> >>>> >>>>>>>> >>>> - git source control and revision numbers >>>>>>>> >>>> - we're not able to easily share code in the mailing list >>>>>>>> >>>> - we can learn from each others' solutions >>>>>>>> >>>> >>>>>>>> >>> >>>>>>>> >>> An example of mailing list limitations: >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> Your mail to 'Omaha' with the subject >>>>>>>> >>> >>>>>>>> >>> Re: [omaha] Group Data Science Competition >>>>>>>> >>> >>>>>>>> >>> Is being held until the list moderator can review it for >>>>>>>> approval. >>>>>>>> >>> >>>>>>>> >>> The reason it is being held: >>>>>>>> >>> >>>>>>>> >>> Message body is too big: 47004 bytes with a limit of 40 KB >>>>>>>> >>> >>>>>>>> >>> (I trimmed out the reply chain; so this may make it through >>>>>>>> first) >>>>>>>> >>> >>>>>>>> >> >>>>>>>> >> >>>>>>>> > >>>>>>>> _______________________________________________ >>>>>>>> Omaha Python Users Group mailing list >>>>>>>> Omaha at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/omaha >>>>>>>> http://www.OmahaPython.org >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > From jeffh at dundeemt.com Wed Dec 21 08:14:34 2016 From: jeffh at dundeemt.com (Jeff Hinrichs - DM&T) Date: Wed, 21 Dec 2016 07:14:34 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: username: dundeemt On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha wrote: > I've created: > > - an omahapython github organization account: > https://github.com/omahapython > > - an @omahapython/datascience team: > https://github.com/orgs/omahapython/teams/datascience > > - an omahapython/datascience repository > https://github.com/omahapython/datascience- > > - omahapython/datascience#3: "Kaggle Best Practices" > https://github.com/omahapython/datascience/issues/3 > > - an omahapython/kaggle-houseprices repository > https://github.com/omahapython/kaggle-houseprices > > - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" > https://github.com/omahapython/kaggle-houseprices/issues/1 > > REQ (Request): Please reply with your github username if you want to be > added to the omahapython org and/or the omahapython/datascience team. All I > need is either: > > username, omahapython > > username, omahapython, @omahapython/datascience > > > "Use @omahapython/datascience to mention this team in comments." > > The @omahapython/datascience team has write access to kaggle-houseprices > (where I'll soon create the recommended kaggle competition folder structure > compiled in omahapython/datascience#3). > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > -- Best, Jeff Hinrichs 402.218.1473 From luke.schollmeyer at gmail.com Wed Dec 21 08:31:43 2016 From: luke.schollmeyer at gmail.com (Luke Schollmeyer) Date: Wed, 21 Dec 2016 07:31:43 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Nice. New submission this morning moved us up 300 spots. On Tue, Dec 20, 2016 at 8:55 PM, Bob Haffner via Omaha wrote: > Hi All, > > I put together a quick histogram of the leaderboard showing the > distribution of the scores under .25. It's a tight race! > https://github.com/bobhaffner/kaggle-houseprices/blob/ > master/kaggle_house_prices_leaderboard.ipynb > > On Mon, Dec 19, 2016 at 9:46 PM, Bob Haffner > wrote: > > > Hi All, I submitted another earlier today, but did not improve upon Wes' > > submission. > > > > I used the same feature set from my Saturday submission just tried some > > different regressors (lasso, elastic and Random Forest Regressor). I > also > > did some cross validation. > > > > Link to my latest notebook on github > > https://github.com/bobhaffner/kaggle-houseprices/blob/ > > master/kaggle_house_prices.ipynb > > > > Bob > > > > On Sun, Dec 18, 2016 at 6:25 PM, Bob Haffner > > wrote: > > > >> Looks like it's not as simple as setting a value to True so I'll let you > >> sort it out. > >> > >> > >> > >> On Sun, Dec 18, 2016 at 1:00 PM, Wes Turner > wrote: > >> > >>> > >>> On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner > >>> wrote: > >>> > >>>> Wes, I can try to run your process with do_get_dummies=True. Anything > >>>> else need to change? > >>>> > >>> > >>> Yup, > >>> > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > >>> e_prices/data.py#L94 : > >>> > >>> if do_get_dummies: > >>> def get_categorical_columns(column_categories): > >>> for colkey in column_categories: > >>> values = column_categories[colkey] > >>> if len(values): > >>> yield colkey > >>> categorical_columns = list(get_categorical_columns( > column_categories)) > >>> get_dummies_dict = {key: key for key in categorical_columns} > >>> df = pd.get_dummies(df, prefix=get_dummies_dict, > columns=get_dummies_dict) > >>> > >>> Needs to also be applied to train_csv and test_csv in the generated and > >>> modified pipeline: > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > >>> e_prices/pipelines/tpot_house_prices__001__modified.py#L40 > >>> > >>> So, I can either copy/paste or factor or it out: > >>> > >>> - copy/paste: just wrong > >>> - factor it out: > >>> - this creates a (new) dependency on house_prices from within the > >>> generated pipeline; which currently depends on [stable versions of] > >>> (datacleaner, pandas, and scikit-learn) > >>> > >>> ... TODO: today > >>> > >>> - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df) > >>> - [ ] Dockerfile > >>> - probably the easiest way to reproduce the environment.yml > >>> - [ ] automate the __modified.py patching process : > >>> > >>> > >>> # git clone ssh://git at github.com/westurner/house_prices # > >>> -b develop > >>> conda env update -f ./environment.yml > >>> cd house_prices/ > >>> > >>> python ./analysis.py > >>> > >>> # (wait) > >>> > >>> mv ./pipelines/tpot_house_prices_.py \ > >>> ./pipelines/tpot_house_prices__002.py > >>> mv ./pipelines/tpot_house_prices_.py.json \ > >>> ./pipelines/tpot_house_prices__002.py.json > >>> cp ./pipelines/tpot_house_prices__001__modified.py \ > >>> ./pipelines/tpot_house_prices__002__modified.py > >>> # copy/paste (TODO: patch/template): > >>> # - exported_pipeline / self.exported_pipeline > >>> # - the sklearn imports] to __002__modified.py > >>> cd pipelines/ # TODO: settings.json > >>> python ./tpot_house_prices__002__modified.py > >>> > >>> > >>> ... The modified pipeline generation is not quite reproducible yet, but > >>> the generated pipeline (tpot_house_prices__001[__modified].py) is. > >>> (With ~2% error ... only about ~$6mn dollars off :|) > >>> > >>> > >>> > >>>> Sent from my iPhone > >>>> > >>>> On Dec 18, 2016, at 10:59 AM, Wes Turner > wrote: > >>>> > >>>> Thanks, Bob! > >>>> > >>>> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner > >>>> wrote: > >>>> > >>>>> Nice job, Wes!! > >>>>> > >>>>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner > >>>>> wrote: > >>>>> > >>>>>> In addition to posting to the mailing list, I created a comment on > >>>>>> the "Kaggle Submissions" issue [1]: > >>>>>> > >>>>>> - Score: 0.13667 (#1370) > >>>>>>> - https://www.kaggle.com/c/house-prices-advanced-regression-te > >>>>>>> chniques/leaderboard?submissionId=3925119 > >>>>>>> - https://mail.python.org/pipermail/omaha/2016-December/002206 > >>>>>>> .html > >>>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py > >>>>>> > >>>>>> > >>>>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 > >>>>>> > >>>>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner > >>>>>> wrote: > >>>>>> > >>>>>>> Sounds great. 1/18. > >>>>>>> > >>>>>>> I just submitted my first submission.csv to Kaggle! [1] > >>>>>>> > >>>>>>> $ python ./tpot_house_prices__001__modified.py > >>>>>>> class_sum: 264144946 > >>>>>>> abs error: 5582809.288 > >>>>>>> % error: 2.11354007432 % > >>>>>>> error**2: 252508654837.0 > >>>>>>> # python ./tpot_house_prices__001__modified.py > >>>>>>> > >>>>>>> > >>>>>>> ... Which moves us up to #1370! > >>>>>>> > >>>>>>> Your Best Entry ? > >>>>>>> You improved on your best score by 0.02469. > >>>>>>> You just moved up 608 positions on the leaderboard. > >>>>>>> > >>>>>>> > >>>>>>> I have a few more things to try: > >>>>>>> > >>>>>>> > >>>>>>> - Manually drop the 'Id' column > >>>>>>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance > >>>>>>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ > >>>>>>> verbosity=2) > >>>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/ > hous > >>>>>>> e_prices/data.py#L94 > >>>>>>> - skleanGridSearch and/or sklearn-deap the TPOT > >>>>>>> hyperparameters > >>>>>>> - http://scikit-learn.org/stable/modules/generated/ > sklearn.mod > >>>>>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridS > >>>>>>> earchCV > >>>>>>> sklearn.model_selection.GridSearchCV.html#sklearn. > model_selection.GridSearchCV> > >>>>>>> - https://github.com/rsteca/sklearn-deap > >>>>>>> - REF,BLD,DOC,TST: > >>>>>>> - factor constants out in favor of settings.json and data.py > >>>>>>> - https://github.com/omahapython > >>>>>>> /kaggle-houseprices/blob/master/src/data.py > >>>>>>> houseprices/blob/master/src/data.py> > >>>>>>> - implement train.py and predict.py, too > >>>>>>> - create a Dockerfile FROM kaggle/docker-python:latest > >>>>>>> - https://github.com/omahapython/datascience/issues/3 > >>>>>>> "Kaggle Best Practices" > >>>>>>> - docstrings, tests > >>>>>>> - https://github.com/omahapython/datascience/wiki/resources > >>>>>>> > >>>>>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py > >>>>>>> > >>>>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha < > >>>>>>> omaha at python.org> wrote: > >>>>>>> > >>>>>>>> Hey all, regarding our January kaggle meetup that we talked about. > >>>>>>>> Maybe > >>>>>>>> we can meet following our regular monthly (1/18). > >>>>>>>> > >>>>>>>> Would that be easier/better for everyone? > >>>>>>>> > >>>>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner < > bob.haffner at gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>> > Just submitted another Linear Regression attempt (0.16136). > >>>>>>>> Added some > >>>>>>>> > features, both numeric and categorical, and created 3 numerics > >>>>>>>> > > >>>>>>>> > -TotalFullBaths > >>>>>>>> > -TotalHalfBaths > >>>>>>>> > -Pool > >>>>>>>> > > >>>>>>>> > Notebook attached > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner < > >>>>>>>> bob.haffner at gmail.com> > >>>>>>>> > wrote: > >>>>>>>> > > >>>>>>>> >> Just submitted another Linear Regression attempt (0.16136). > >>>>>>>> Added some > >>>>>>>> >> features, both numeric and categorical, and created 3 numerics > >>>>>>>> >> > >>>>>>>> >> -TotalFullBaths > >>>>>>>> >> -TotalHalfBaths > >>>>>>>> >> -Pool > >>>>>>>> >> > >>>>>>>> >> Notebook attached > >>>>>>>> >> > >>>>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner < > >>>>>>>> wes.turner at gmail.com> wrote: > >>>>>>>> >> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner < > >>>>>>>> wes.turner at gmail.com> > >>>>>>>> >>> wrote: > >>>>>>>> >>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < > >>>>>>>> >>>> omaha at python.org> wrote: > >>>>>>>> >>>> > >>>>>>>> >>>>> >Does Kaggle take the high mark but still give a score for > >>>>>>>> each > >>>>>>>> >>>>> submission? > >>>>>>>> >>>>> Yes. > >>>>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced- > regression-te > >>>>>>>> >>>>> chniques/submissions > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> >Thinking of ways to keep track of which code produced which > >>>>>>>> score; > >>>>>>>> >>>>> I'll > >>>>>>>> >>>>> >post about the GitHub setup in a bit. > >>>>>>>> >>>>> We could push our notebooks to the github repo? Maybe > >>>>>>>> include a brief > >>>>>>>> >>>>> description at the top in a markdown cell > >>>>>>>> >>>>> > >>>>>>>> >>>> > >>>>>>>> >>>> In my research [1], I found that the preferred folder > >>>>>>>> structure for > >>>>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and > >>>>>>>> working/ > >>>>>>>> >>>> (outputs); > >>>>>>>> >>>> and that they recommend creating a settings.json with path > >>>>>>>> >>>> configuration (e.g. pointing to input/, src/ data/) > >>>>>>>> >>>> > >>>>>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. > >>>>>>>> >>>> > >>>>>>>> >>>> runipy is a bit more scriptable than requiring notebook gui > >>>>>>>> >>>> interactions [3]. > >>>>>>>> >>>> > >>>>>>>> >>>> We could either hardcode '../input/test.csv' in our .py and > >>>>>>>> .ipnb > >>>>>>>> >>>> sources, or we could write a function in src/data.py to read > >>>>>>>> >>>> '../settings.json' into a dict with the recommended variable > >>>>>>>> names [1]: > >>>>>>>> >>>> > >>>>>>>> >>>> from data import read_settings_json > >>>>>>>> >>>> settings = read_settings_json() > >>>>>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) > >>>>>>>> >>>> # .... > >>>>>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) > >>>>>>>> >>>> > >>>>>>>> >>>> [1] https://github.com/omahapython > >>>>>>>> /datascience/issues/3#issuecom > >>>>>>>> >>>> ment-267236556 > >>>>>>>> >>>> [2] https://github.com/omahapython > >>>>>>>> /kaggle-houseprices/tree/master/src > >>>>>>>> >>>> [3] https://pypi.python.org/pypi/runipy > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> I initially thought github was a good way to go, but I don't > >>>>>>>> know if > >>>>>>>> >>>>> everyone has a github acct or is interested in starting one. > >>>>>>>> Maybe > >>>>>>>> >>>>> email > >>>>>>>> >>>>> is the way to go? > >>>>>>>> >>>>> > >>>>>>>> >>>> > >>>>>>>> >>>> I'm all for GitHub: > >>>>>>>> >>>> > >>>>>>>> >>>> - git source control and revision numbers > >>>>>>>> >>>> - we're not able to easily share code in the mailing list > >>>>>>>> >>>> - we can learn from each others' solutions > >>>>>>>> >>>> > >>>>>>>> >>> > >>>>>>>> >>> An example of mailing list limitations: > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> Your mail to 'Omaha' with the subject > >>>>>>>> >>> > >>>>>>>> >>> Re: [omaha] Group Data Science Competition > >>>>>>>> >>> > >>>>>>>> >>> Is being held until the list moderator can review it for > >>>>>>>> approval. > >>>>>>>> >>> > >>>>>>>> >>> The reason it is being held: > >>>>>>>> >>> > >>>>>>>> >>> Message body is too big: 47004 bytes with a limit of 40 KB > >>>>>>>> >>> > >>>>>>>> >>> (I trimmed out the reply chain; so this may make it through > >>>>>>>> first) > >>>>>>>> >>> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> > > >>>>>>>> _______________________________________________ > >>>>>>>> Omaha Python Users Group mailing list > >>>>>>>> Omaha at python.org > >>>>>>>> https://mail.python.org/mailman/listinfo/omaha > >>>>>>>> http://www.OmahaPython.org > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Wed Dec 21 11:51:47 2016 From: wes.turner at gmail.com (Wes Turner) Date: Wed, 21 Dec 2016 10:51:47 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: On Wednesday, December 21, 2016, Jeff Hinrichs - DM&T wrote: > username: dundeemt > Would you also like to be in the omahapython/datascience team? > > On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha > wrote: > >> I've created: >> >> - an omahapython github organization account: >> https://github.com/omahapython >> >> - an @omahapython/datascience team: >> https://github.com/orgs/omahapython/teams/datascience >> >> - an omahapython/datascience repository >> https://github.com/omahapython/datascience- >> >> - omahapython/datascience#3: "Kaggle Best Practices" >> https://github.com/omahapython/datascience/issues/3 >> >> - an omahapython/kaggle-houseprices repository >> https://github.com/omahapython/kaggle-houseprices >> >> - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" >> https://github.com/omahapython/kaggle-houseprices/issues/1 >> >> REQ (Request): Please reply with your github username if you want to be >> added to the omahapython org and/or the omahapython/datascience team. All >> I >> need is either: >> >> username, omahapython >> >> username, omahapython, @omahapython/datascience >> >> > "Use @omahapython/datascience to mention this team in comments." >> >> The @omahapython/datascience team has write access to kaggle-houseprices >> (where I'll soon create the recommended kaggle competition folder >> structure >> compiled in omahapython/datascience#3). >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > > > -- > Best, > > Jeff Hinrichs > 402.218.1473 > > From jeffh at delasco.com Wed Dec 21 11:57:34 2016 From: jeffh at delasco.com (Jeff Hinrichs) Date: Wed, 21 Dec 2016 10:57:34 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: Yes, please and thank-you On Dec 21, 2016 10:52 AM, "Wes Turner via Omaha" wrote: > On Wednesday, December 21, 2016, Jeff Hinrichs - DM&T > wrote: > > > username: dundeemt > > > > Would you also like to be in the omahapython/datascience team? > > > > > > On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha > > wrote: > > > >> I've created: > >> > >> - an omahapython github organization account: > >> https://github.com/omahapython > >> > >> - an @omahapython/datascience team: > >> https://github.com/orgs/omahapython/teams/datascience > >> > >> - an omahapython/datascience repository > >> https://github.com/omahapython/datascience- > >> > >> - omahapython/datascience#3: "Kaggle Best Practices" > >> https://github.com/omahapython/datascience/issues/3 > >> > >> - an omahapython/kaggle-houseprices repository > >> https://github.com/omahapython/kaggle-houseprices > >> > >> - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" > >> https://github.com/omahapython/kaggle-houseprices/issues/1 > >> > >> REQ (Request): Please reply with your github username if you want to be > >> added to the omahapython org and/or the omahapython/datascience team. > All > >> I > >> need is either: > >> > >> username, omahapython > >> > >> username, omahapython, @omahapython/datascience > >> > >> > "Use @omahapython/datascience to mention this team in comments." > >> > >> The @omahapython/datascience team has write access to kaggle-houseprices > >> (where I'll soon create the recommended kaggle competition folder > >> structure > >> compiled in omahapython/datascience#3). > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> https://mail.python.org/mailman/listinfo/omaha > >> http://www.OmahaPython.org > >> > > > > > > > > -- > > Best, > > > > Jeff Hinrichs > > 402.218.1473 > > > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Wed Dec 21 12:09:42 2016 From: wes.turner at gmail.com (Wes Turner) Date: Wed, 21 Dec 2016 11:09:42 -0600 Subject: [omaha] github.com/omahapython and @omahapython/datascience REQ: for usernames In-Reply-To: References: Message-ID: yw. invitations sent. On Wed, Dec 21, 2016 at 10:57 AM, Jeff Hinrichs via Omaha wrote: > Yes, please and thank-you > > On Dec 21, 2016 10:52 AM, "Wes Turner via Omaha" wrote: > > > On Wednesday, December 21, 2016, Jeff Hinrichs - DM&T < > jeffh at dundeemt.com> > > wrote: > > > > > username: dundeemt > > > > > > > Would you also like to be in the omahapython/datascience team? > > > > > > > > > > On Sat, Dec 17, 2016 at 2:24 PM, Wes Turner via Omaha < > omaha at python.org > > > > wrote: > > > > > >> I've created: > > >> > > >> - an omahapython github organization account: > > >> https://github.com/omahapython > > >> > > >> - an @omahapython/datascience team: > > >> https://github.com/orgs/omahapython/teams/datascience > > >> > > >> - an omahapython/datascience repository > > >> https://github.com/omahapython/datascience- > > >> > > >> - omahapython/datascience#3: "Kaggle Best Practices" > > >> https://github.com/omahapython/datascience/issues/3 > > >> > > >> - an omahapython/kaggle-houseprices repository > > >> https://github.com/omahapython/kaggle-houseprices > > >> > > >> - omahapython/kaggle-houseprices#1: "kaggle-houseprices #1" > > >> https://github.com/omahapython/kaggle-houseprices/issues/1 > > >> > > >> REQ (Request): Please reply with your github username if you want to > be > > >> added to the omahapython org and/or the omahapython/datascience team. > > All > > >> I > > >> need is either: > > >> > > >> username, omahapython > > >> > > >> username, omahapython, @omahapython/datascience > > >> > > >> > "Use @omahapython/datascience to mention this team in comments." > > >> > > >> The @omahapython/datascience team has write access to > kaggle-houseprices > > >> (where I'll soon create the recommended kaggle competition folder > > >> structure > > >> compiled in omahapython/datascience#3). > > >> _______________________________________________ > > >> Omaha Python Users Group mailing list > > >> Omaha at python.org > > >> https://mail.python.org/mailman/listinfo/omaha > > >> http://www.OmahaPython.org > > >> > > > > > > > > > > > > -- > > > Best, > > > > > > Jeff Hinrichs > > > 402.218.1473 > > > > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bob.haffner at gmail.com Wed Dec 21 12:11:45 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Wed, 21 Dec 2016 11:11:45 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Way to go, Luke!!!! https://github.com/bobhaffner/kaggle-houseprices/blob/master/kaggle_house_prices_leaderboard.ipynb On Wed, Dec 21, 2016 at 7:31 AM, Luke Schollmeyer via Omaha < omaha at python.org> wrote: > Nice. New submission this morning moved us up 300 spots. > > On Tue, Dec 20, 2016 at 8:55 PM, Bob Haffner via Omaha > wrote: > > > Hi All, > > > > I put together a quick histogram of the leaderboard showing the > > distribution of the scores under .25. It's a tight race! > > https://github.com/bobhaffner/kaggle-houseprices/blob/ > > master/kaggle_house_prices_leaderboard.ipynb > > > > On Mon, Dec 19, 2016 at 9:46 PM, Bob Haffner > > wrote: > > > > > Hi All, I submitted another earlier today, but did not improve upon > Wes' > > > submission. > > > > > > I used the same feature set from my Saturday submission just tried some > > > different regressors (lasso, elastic and Random Forest Regressor). I > > also > > > did some cross validation. > > > > > > Link to my latest notebook on github > > > https://github.com/bobhaffner/kaggle-houseprices/blob/ > > > master/kaggle_house_prices.ipynb > > > > > > Bob > > > > > > On Sun, Dec 18, 2016 at 6:25 PM, Bob Haffner > > > wrote: > > > > > >> Looks like it's not as simple as setting a value to True so I'll let > you > > >> sort it out. > > >> > > >> > > >> > > >> On Sun, Dec 18, 2016 at 1:00 PM, Wes Turner > > wrote: > > >> > > >>> > > >>> On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner > > > >>> wrote: > > >>> > > >>>> Wes, I can try to run your process with do_get_dummies=True. > Anything > > >>>> else need to change? > > >>>> > > >>> > > >>> Yup, > > >>> > > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > > >>> e_prices/data.py#L94 : > > >>> > > >>> if do_get_dummies: > > >>> def get_categorical_columns(column_categories): > > >>> for colkey in column_categories: > > >>> values = column_categories[colkey] > > >>> if len(values): > > >>> yield colkey > > >>> categorical_columns = list(get_categorical_columns( > > column_categories)) > > >>> get_dummies_dict = {key: key for key in categorical_columns} > > >>> df = pd.get_dummies(df, prefix=get_dummies_dict, > > columns=get_dummies_dict) > > >>> > > >>> Needs to also be applied to train_csv and test_csv in the generated > and > > >>> modified pipeline: > > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > > >>> e_prices/pipelines/tpot_house_prices__001__modified.py#L40 > > >>> > > >>> So, I can either copy/paste or factor or it out: > > >>> > > >>> - copy/paste: just wrong > > >>> - factor it out: > > >>> - this creates a (new) dependency on house_prices from within the > > >>> generated pipeline; which currently depends on [stable versions of] > > >>> (datacleaner, pandas, and scikit-learn) > > >>> > > >>> ... TODO: today > > >>> > > >>> - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df) > > >>> - [ ] Dockerfile > > >>> - probably the easiest way to reproduce the environment.yml > > >>> - [ ] automate the __modified.py patching process : > > >>> > > >>> > > >>> # git clone ssh://git at github.com/westurner/house_prices > # > > >>> -b develop > > >>> conda env update -f ./environment.yml > > >>> cd house_prices/ > > >>> > > >>> python ./analysis.py > > >>> > > >>> # (wait) > > >>> > > >>> mv ./pipelines/tpot_house_prices_.py \ > > >>> ./pipelines/tpot_house_prices__002.py > > >>> mv ./pipelines/tpot_house_prices_.py.json \ > > >>> ./pipelines/tpot_house_prices__002.py.json > > >>> cp ./pipelines/tpot_house_prices__001__modified.py \ > > >>> ./pipelines/tpot_house_prices__002__modified.py > > >>> # copy/paste (TODO: patch/template): > > >>> # - exported_pipeline / self.exported_pipeline > > >>> # - the sklearn imports] to __002__modified.py > > >>> cd pipelines/ # TODO: settings.json > > >>> python ./tpot_house_prices__002__modified.py > > >>> > > >>> > > >>> ... The modified pipeline generation is not quite reproducible yet, > but > > >>> the generated pipeline (tpot_house_prices__001[__modified].py) is. > > >>> (With ~2% error ... only about ~$6mn dollars off :|) > > >>> > > >>> > > >>> > > >>>> Sent from my iPhone > > >>>> > > >>>> On Dec 18, 2016, at 10:59 AM, Wes Turner > > wrote: > > >>>> > > >>>> Thanks, Bob! > > >>>> > > >>>> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner > > > >>>> wrote: > > >>>> > > >>>>> Nice job, Wes!! > > >>>>> > > >>>>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner > > >>>>> wrote: > > >>>>> > > >>>>>> In addition to posting to the mailing list, I created a comment on > > >>>>>> the "Kaggle Submissions" issue [1]: > > >>>>>> > > >>>>>> - Score: 0.13667 (#1370) > > >>>>>>> - https://www.kaggle.com/c/house-prices-advanced-regression-te > > >>>>>>> chniques/leaderboard?submissionId=3925119 > > >>>>>>> - https://mail.python.org/pipermail/omaha/2016-December/002206 > > >>>>>>> .html > > >>>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous > > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py > > >>>>>> > > >>>>>> > > >>>>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 > > >>>>>> > > >>>>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner > > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Sounds great. 1/18. > > >>>>>>> > > >>>>>>> I just submitted my first submission.csv to Kaggle! [1] > > >>>>>>> > > >>>>>>> $ python ./tpot_house_prices__001__modified.py > > >>>>>>> class_sum: 264144946 > > >>>>>>> abs error: 5582809.288 > > >>>>>>> % error: 2.11354007432 % > > >>>>>>> error**2: 252508654837.0 > > >>>>>>> # python ./tpot_house_prices__001__modified.py > > >>>>>>> > > >>>>>>> > > >>>>>>> ... Which moves us up to #1370! > > >>>>>>> > > >>>>>>> Your Best Entry ? > > >>>>>>> You improved on your best score by 0.02469. > > >>>>>>> You just moved up 608 positions on the leaderboard. > > >>>>>>> > > >>>>>>> > > >>>>>>> I have a few more things to try: > > >>>>>>> > > >>>>>>> > > >>>>>>> - Manually drop the 'Id' column > > >>>>>>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance > > >>>>>>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ > > >>>>>>> verbosity=2) > > >>>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/ > > hous > > >>>>>>> e_prices/data.py#L94 > > >>>>>>> - skleanGridSearch and/or sklearn-deap the TPOT > > >>>>>>> hyperparameters > > >>>>>>> - http://scikit-learn.org/stable/modules/generated/ > > sklearn.mod > > >>>>>>> el_selection.GridSearchCV.html#sklearn.model_selection. > GridS > > >>>>>>> earchCV > > >>>>>>> > sklearn.model_selection.GridSearchCV.html#sklearn. > > model_selection.GridSearchCV> > > >>>>>>> - https://github.com/rsteca/sklearn-deap > > >>>>>>> - REF,BLD,DOC,TST: > > >>>>>>> - factor constants out in favor of settings.json and > data.py > > >>>>>>> - https://github.com/omahapython > > >>>>>>> /kaggle-houseprices/blob/master/src/data.py > > >>>>>>> > houseprices/blob/master/src/data.py> > > >>>>>>> - implement train.py and predict.py, too > > >>>>>>> - create a Dockerfile FROM kaggle/docker-python:latest > > >>>>>>> - https://github.com/omahapython/datascience/issues/3 > > >>>>>>> "Kaggle Best Practices" > > >>>>>>> - docstrings, tests > > >>>>>>> - https://github.com/omahapython/datascience/wiki/resources > > >>>>>>> > > >>>>>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous > > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py > > >>>>>>> > > >>>>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha < > > >>>>>>> omaha at python.org> wrote: > > >>>>>>> > > >>>>>>>> Hey all, regarding our January kaggle meetup that we talked > about. > > >>>>>>>> Maybe > > >>>>>>>> we can meet following our regular monthly (1/18). > > >>>>>>>> > > >>>>>>>> Would that be easier/better for everyone? > > >>>>>>>> > > >>>>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner < > > bob.haffner at gmail.com> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>> > Just submitted another Linear Regression attempt (0.16136). > > >>>>>>>> Added some > > >>>>>>>> > features, both numeric and categorical, and created 3 numerics > > >>>>>>>> > > > >>>>>>>> > -TotalFullBaths > > >>>>>>>> > -TotalHalfBaths > > >>>>>>>> > -Pool > > >>>>>>>> > > > >>>>>>>> > Notebook attached > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner < > > >>>>>>>> bob.haffner at gmail.com> > > >>>>>>>> > wrote: > > >>>>>>>> > > > >>>>>>>> >> Just submitted another Linear Regression attempt (0.16136). > > >>>>>>>> Added some > > >>>>>>>> >> features, both numeric and categorical, and created 3 > numerics > > >>>>>>>> >> > > >>>>>>>> >> -TotalFullBaths > > >>>>>>>> >> -TotalHalfBaths > > >>>>>>>> >> -Pool > > >>>>>>>> >> > > >>>>>>>> >> Notebook attached > > >>>>>>>> >> > > >>>>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner < > > >>>>>>>> wes.turner at gmail.com> wrote: > > >>>>>>>> >> > > >>>>>>>> >>> > > >>>>>>>> >>> > > >>>>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner < > > >>>>>>>> wes.turner at gmail.com> > > >>>>>>>> >>> wrote: > > >>>>>>>> >>> > > >>>>>>>> >>>> > > >>>>>>>> >>>> > > >>>>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < > > >>>>>>>> >>>> omaha at python.org> wrote: > > >>>>>>>> >>>> > > >>>>>>>> >>>>> >Does Kaggle take the high mark but still give a score for > > >>>>>>>> each > > >>>>>>>> >>>>> submission? > > >>>>>>>> >>>>> Yes. > > >>>>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced- > > regression-te > > >>>>>>>> >>>>> chniques/submissions > > >>>>>>>> >>>>> > > >>>>>>>> >>>>> > > >>>>>>>> >>>>> >Thinking of ways to keep track of which code produced > which > > >>>>>>>> score; > > >>>>>>>> >>>>> I'll > > >>>>>>>> >>>>> >post about the GitHub setup in a bit. > > >>>>>>>> >>>>> We could push our notebooks to the github repo? Maybe > > >>>>>>>> include a brief > > >>>>>>>> >>>>> description at the top in a markdown cell > > >>>>>>>> >>>>> > > >>>>>>>> >>>> > > >>>>>>>> >>>> In my research [1], I found that the preferred folder > > >>>>>>>> structure for > > >>>>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and > > >>>>>>>> working/ > > >>>>>>>> >>>> (outputs); > > >>>>>>>> >>>> and that they recommend creating a settings.json with path > > >>>>>>>> >>>> configuration (e.g. pointing to input/, src/ data/) > > >>>>>>>> >>>> > > >>>>>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. > > >>>>>>>> >>>> > > >>>>>>>> >>>> runipy is a bit more scriptable than requiring notebook gui > > >>>>>>>> >>>> interactions [3]. > > >>>>>>>> >>>> > > >>>>>>>> >>>> We could either hardcode '../input/test.csv' in our .py and > > >>>>>>>> .ipnb > > >>>>>>>> >>>> sources, or we could write a function in src/data.py to > read > > >>>>>>>> >>>> '../settings.json' into a dict with the recommended > variable > > >>>>>>>> names [1]: > > >>>>>>>> >>>> > > >>>>>>>> >>>> from data import read_settings_json > > >>>>>>>> >>>> settings = read_settings_json() > > >>>>>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) > > >>>>>>>> >>>> # .... > > >>>>>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) > > >>>>>>>> >>>> > > >>>>>>>> >>>> [1] https://github.com/omahapython > > >>>>>>>> /datascience/issues/3#issuecom > > >>>>>>>> >>>> ment-267236556 > > >>>>>>>> >>>> [2] https://github.com/omahapython > > >>>>>>>> /kaggle-houseprices/tree/master/src > > >>>>>>>> >>>> [3] https://pypi.python.org/pypi/runipy > > >>>>>>>> >>>> > > >>>>>>>> >>>> > > >>>>>>>> >>>>> > > >>>>>>>> >>>>> I initially thought github was a good way to go, but I > don't > > >>>>>>>> know if > > >>>>>>>> >>>>> everyone has a github acct or is interested in starting > one. > > >>>>>>>> Maybe > > >>>>>>>> >>>>> email > > >>>>>>>> >>>>> is the way to go? > > >>>>>>>> >>>>> > > >>>>>>>> >>>> > > >>>>>>>> >>>> I'm all for GitHub: > > >>>>>>>> >>>> > > >>>>>>>> >>>> - git source control and revision numbers > > >>>>>>>> >>>> - we're not able to easily share code in the mailing list > > >>>>>>>> >>>> - we can learn from each others' solutions > > >>>>>>>> >>>> > > >>>>>>>> >>> > > >>>>>>>> >>> An example of mailing list limitations: > > >>>>>>>> >>> > > >>>>>>>> >>> > > >>>>>>>> >>> Your mail to 'Omaha' with the subject > > >>>>>>>> >>> > > >>>>>>>> >>> Re: [omaha] Group Data Science Competition > > >>>>>>>> >>> > > >>>>>>>> >>> Is being held until the list moderator can review it for > > >>>>>>>> approval. > > >>>>>>>> >>> > > >>>>>>>> >>> The reason it is being held: > > >>>>>>>> >>> > > >>>>>>>> >>> Message body is too big: 47004 bytes with a limit of 40 > KB > > >>>>>>>> >>> > > >>>>>>>> >>> (I trimmed out the reply chain; so this may make it through > > >>>>>>>> first) > > >>>>>>>> >>> > > >>>>>>>> >> > > >>>>>>>> >> > > >>>>>>>> > > > >>>>>>>> _______________________________________________ > > >>>>>>>> Omaha Python Users Group mailing list > > >>>>>>>> Omaha at python.org > > >>>>>>>> https://mail.python.org/mailman/listinfo/omaha > > >>>>>>>> http://www.OmahaPython.org > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Wed Dec 21 11:52:27 2016 From: wes.turner at gmail.com (Wes Turner) Date: Wed, 21 Dec 2016 10:52:27 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Wednesday, December 21, 2016, Luke Schollmeyer via Omaha < omaha at python.org> wrote: > Nice. New submission this morning moved us up 300 spots. Nice work. How'd you do it? > > On Tue, Dec 20, 2016 at 8:55 PM, Bob Haffner via Omaha > > wrote: > > > Hi All, > > > > I put together a quick histogram of the leaderboard showing the > > distribution of the scores under .25. It's a tight race! > > https://github.com/bobhaffner/kaggle-houseprices/blob/ > > master/kaggle_house_prices_leaderboard.ipynb > > > > On Mon, Dec 19, 2016 at 9:46 PM, Bob Haffner > > > wrote: > > > > > Hi All, I submitted another earlier today, but did not improve upon > Wes' > > > submission. > > > > > > I used the same feature set from my Saturday submission just tried some > > > different regressors (lasso, elastic and Random Forest Regressor). I > > also > > > did some cross validation. > > > > > > Link to my latest notebook on github > > > https://github.com/bobhaffner/kaggle-houseprices/blob/ > > > master/kaggle_house_prices.ipynb > > > > > > Bob > > > > > > On Sun, Dec 18, 2016 at 6:25 PM, Bob Haffner > > > > wrote: > > > > > >> Looks like it's not as simple as setting a value to True so I'll let > you > > >> sort it out. > > >> > > >> > > >> > > >> On Sun, Dec 18, 2016 at 1:00 PM, Wes Turner > > > wrote: > > >> > > >>> > > >>> On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner > > > >>> wrote: > > >>> > > >>>> Wes, I can try to run your process with do_get_dummies=True. > Anything > > >>>> else need to change? > > >>>> > > >>> > > >>> Yup, > > >>> > > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > > >>> e_prices/data.py#L94 : > > >>> > > >>> if do_get_dummies: > > >>> def get_categorical_columns(column_categories): > > >>> for colkey in column_categories: > > >>> values = column_categories[colkey] > > >>> if len(values): > > >>> yield colkey > > >>> categorical_columns = list(get_categorical_columns( > > column_categories)) > > >>> get_dummies_dict = {key: key for key in categorical_columns} > > >>> df = pd.get_dummies(df, prefix=get_dummies_dict, > > columns=get_dummies_dict) > > >>> > > >>> Needs to also be applied to train_csv and test_csv in the generated > and > > >>> modified pipeline: > > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > > >>> e_prices/pipelines/tpot_house_prices__001__modified.py#L40 > > >>> > > >>> So, I can either copy/paste or factor or it out: > > >>> > > >>> - copy/paste: just wrong > > >>> - factor it out: > > >>> - this creates a (new) dependency on house_prices from within the > > >>> generated pipeline; which currently depends on [stable versions of] > > >>> (datacleaner, pandas, and scikit-learn) > > >>> > > >>> ... TODO: today > > >>> > > >>> - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df) > > >>> - [ ] Dockerfile > > >>> - probably the easiest way to reproduce the environment.yml > > >>> - [ ] automate the __modified.py patching process : > > >>> > > >>> > > >>> # git clone ssh://git at github.com/westurner/house_prices > # > > >>> -b develop > > >>> conda env update -f ./environment.yml > > >>> cd house_prices/ > > >>> > > >>> python ./analysis.py > > >>> > > >>> # (wait) > > >>> > > >>> mv ./pipelines/tpot_house_prices_.py \ > > >>> ./pipelines/tpot_house_prices__002.py > > >>> mv ./pipelines/tpot_house_prices_.py.json \ > > >>> ./pipelines/tpot_house_prices__002.py.json > > >>> cp ./pipelines/tpot_house_prices__001__modified.py \ > > >>> ./pipelines/tpot_house_prices__002__modified.py > > >>> # copy/paste (TODO: patch/template): > > >>> # - exported_pipeline / self.exported_pipeline > > >>> # - the sklearn imports] to __002__modified.py > > >>> cd pipelines/ # TODO: settings.json > > >>> python ./tpot_house_prices__002__modified.py > > >>> > > >>> > > >>> ... The modified pipeline generation is not quite reproducible yet, > but > > >>> the generated pipeline (tpot_house_prices__001[__modified].py) is. > > >>> (With ~2% error ... only about ~$6mn dollars off :|) > > >>> > > >>> > > >>> > > >>>> Sent from my iPhone > > >>>> > > >>>> On Dec 18, 2016, at 10:59 AM, Wes Turner > > > wrote: > > >>>> > > >>>> Thanks, Bob! > > >>>> > > >>>> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner > > > >>>> wrote: > > >>>> > > >>>>> Nice job, Wes!! > > >>>>> > > >>>>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner > > > >>>>> wrote: > > >>>>> > > >>>>>> In addition to posting to the mailing list, I created a comment on > > >>>>>> the "Kaggle Submissions" issue [1]: > > >>>>>> > > >>>>>> - Score: 0.13667 (#1370) > > >>>>>>> - https://www.kaggle.com/c/house-prices-advanced-regression-te > > >>>>>>> chniques/leaderboard?submissionId=3925119 > > >>>>>>> - https://mail.python.org/pipermail/omaha/2016-December/002206 > > >>>>>>> .html > > >>>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/hous > > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py > > >>>>>> > > >>>>>> > > >>>>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 > > >>>>>> > > >>>>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner > > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Sounds great. 1/18. > > >>>>>>> > > >>>>>>> I just submitted my first submission.csv to Kaggle! [1] > > >>>>>>> > > >>>>>>> $ python ./tpot_house_prices__001__modified.py > > >>>>>>> class_sum: 264144946 > > >>>>>>> abs error: 5582809.288 > > >>>>>>> % error: 2.11354007432 % > > >>>>>>> error**2: 252508654837.0 > > >>>>>>> # python ./tpot_house_prices__001__modified.py > > >>>>>>> > > >>>>>>> > > >>>>>>> ... Which moves us up to #1370! > > >>>>>>> > > >>>>>>> Your Best Entry ? > > >>>>>>> You improved on your best score by 0.02469. > > >>>>>>> You just moved up 608 positions on the leaderboard. > > >>>>>>> > > >>>>>>> > > >>>>>>> I have a few more things to try: > > >>>>>>> > > >>>>>>> > > >>>>>>> - Manually drop the 'Id' column > > >>>>>>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance > > >>>>>>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ > > >>>>>>> verbosity=2) > > >>>>>>> - https://github.com/westurner/house_prices/blob/2839ff8a/ > > hous > > >>>>>>> e_prices/data.py#L94 > > >>>>>>> - skleanGridSearch and/or sklearn-deap the TPOT > > >>>>>>> hyperparameters > > >>>>>>> - http://scikit-learn.org/stable/modules/generated/ > > sklearn.mod > > >>>>>>> el_selection.GridSearchCV.html#sklearn.model_selection. > GridS > > >>>>>>> earchCV > > >>>>>>> > sklearn.model_selection.GridSearchCV.html#sklearn. > > model_selection.GridSearchCV> > > >>>>>>> - https://github.com/rsteca/sklearn-deap > > >>>>>>> - REF,BLD,DOC,TST: > > >>>>>>> - factor constants out in favor of settings.json and > data.py > > >>>>>>> - https://github.com/omahapython > > >>>>>>> /kaggle-houseprices/blob/master/src/data.py > > >>>>>>> > houseprices/blob/master/src/data.py> > > >>>>>>> - implement train.py and predict.py, too > > >>>>>>> - create a Dockerfile FROM kaggle/docker-python:latest > > >>>>>>> - https://github.com/omahapython/datascience/issues/3 > > >>>>>>> "Kaggle Best Practices" > > >>>>>>> - docstrings, tests > > >>>>>>> - https://github.com/omahapython/datascience/wiki/resources > > >>>>>>> > > >>>>>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous > > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py > > >>>>>>> > > >>>>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha < > > >>>>>>> omaha at python.org > wrote: > > >>>>>>> > > >>>>>>>> Hey all, regarding our January kaggle meetup that we talked > about. > > >>>>>>>> Maybe > > >>>>>>>> we can meet following our regular monthly (1/18). > > >>>>>>>> > > >>>>>>>> Would that be easier/better for everyone? > > >>>>>>>> > > >>>>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner < > > bob.haffner at gmail.com > > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>> > Just submitted another Linear Regression attempt (0.16136). > > >>>>>>>> Added some > > >>>>>>>> > features, both numeric and categorical, and created 3 numerics > > >>>>>>>> > > > >>>>>>>> > -TotalFullBaths > > >>>>>>>> > -TotalHalfBaths > > >>>>>>>> > -Pool > > >>>>>>>> > > > >>>>>>>> > Notebook attached > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner < > > >>>>>>>> bob.haffner at gmail.com > > > >>>>>>>> > wrote: > > >>>>>>>> > > > >>>>>>>> >> Just submitted another Linear Regression attempt (0.16136). > > >>>>>>>> Added some > > >>>>>>>> >> features, both numeric and categorical, and created 3 > numerics > > >>>>>>>> >> > > >>>>>>>> >> -TotalFullBaths > > >>>>>>>> >> -TotalHalfBaths > > >>>>>>>> >> -Pool > > >>>>>>>> >> > > >>>>>>>> >> Notebook attached > > >>>>>>>> >> > > >>>>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner < > > >>>>>>>> wes.turner at gmail.com > wrote: > > >>>>>>>> >> > > >>>>>>>> >>> > > >>>>>>>> >>> > > >>>>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner < > > >>>>>>>> wes.turner at gmail.com > > > >>>>>>>> >>> wrote: > > >>>>>>>> >>> > > >>>>>>>> >>>> > > >>>>>>>> >>>> > > >>>>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < > > >>>>>>>> >>>> omaha at python.org > wrote: > > >>>>>>>> >>>> > > >>>>>>>> >>>>> >Does Kaggle take the high mark but still give a score for > > >>>>>>>> each > > >>>>>>>> >>>>> submission? > > >>>>>>>> >>>>> Yes. > > >>>>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced- > > regression-te > > >>>>>>>> >>>>> chniques/submissions > > >>>>>>>> >>>>> > > >>>>>>>> >>>>> > > >>>>>>>> >>>>> >Thinking of ways to keep track of which code produced > which > > >>>>>>>> score; > > >>>>>>>> >>>>> I'll > > >>>>>>>> >>>>> >post about the GitHub setup in a bit. > > >>>>>>>> >>>>> We could push our notebooks to the github repo? Maybe > > >>>>>>>> include a brief > > >>>>>>>> >>>>> description at the top in a markdown cell > > >>>>>>>> >>>>> > > >>>>>>>> >>>> > > >>>>>>>> >>>> In my research [1], I found that the preferred folder > > >>>>>>>> structure for > > >>>>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and > > >>>>>>>> working/ > > >>>>>>>> >>>> (outputs); > > >>>>>>>> >>>> and that they recommend creating a settings.json with path > > >>>>>>>> >>>> configuration (e.g. pointing to input/, src/ data/) > > >>>>>>>> >>>> > > >>>>>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2]. > > >>>>>>>> >>>> > > >>>>>>>> >>>> runipy is a bit more scriptable than requiring notebook gui > > >>>>>>>> >>>> interactions [3]. > > >>>>>>>> >>>> > > >>>>>>>> >>>> We could either hardcode '../input/test.csv' in our .py and > > >>>>>>>> .ipnb > > >>>>>>>> >>>> sources, or we could write a function in src/data.py to > read > > >>>>>>>> >>>> '../settings.json' into a dict with the recommended > variable > > >>>>>>>> names [1]: > > >>>>>>>> >>>> > > >>>>>>>> >>>> from data import read_settings_json > > >>>>>>>> >>>> settings = read_settings_json() > > >>>>>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) > > >>>>>>>> >>>> # .... > > >>>>>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) > > >>>>>>>> >>>> > > >>>>>>>> >>>> [1] https://github.com/omahapython > > >>>>>>>> /datascience/issues/3#issuecom > > >>>>>>>> >>>> ment-267236556 > > >>>>>>>> >>>> [2] https://github.com/omahapython > > >>>>>>>> /kaggle-houseprices/tree/master/src > > >>>>>>>> >>>> [3] https://pypi.python.org/pypi/runipy > > >>>>>>>> >>>> > > >>>>>>>> >>>> > > >>>>>>>> >>>>> > > >>>>>>>> >>>>> I initially thought github was a good way to go, but I > don't > > >>>>>>>> know if > > >>>>>>>> >>>>> everyone has a github acct or is interested in starting > one. > > >>>>>>>> Maybe > > >>>>>>>> >>>>> email > > >>>>>>>> >>>>> is the way to go? > > >>>>>>>> >>>>> > > >>>>>>>> >>>> > > >>>>>>>> >>>> I'm all for GitHub: > > >>>>>>>> >>>> > > >>>>>>>> >>>> - git source control and revision numbers > > >>>>>>>> >>>> - we're not able to easily share code in the mailing list > > >>>>>>>> >>>> - we can learn from each others' solutions > > >>>>>>>> >>>> > > >>>>>>>> >>> > > >>>>>>>> >>> An example of mailing list limitations: > > >>>>>>>> >>> > > >>>>>>>> >>> > > >>>>>>>> >>> Your mail to 'Omaha' with the subject > > >>>>>>>> >>> > > >>>>>>>> >>> Re: [omaha] Group Data Science Competition > > >>>>>>>> >>> > > >>>>>>>> >>> Is being held until the list moderator can review it for > > >>>>>>>> approval. > > >>>>>>>> >>> > > >>>>>>>> >>> The reason it is being held: > > >>>>>>>> >>> > > >>>>>>>> >>> Message body is too big: 47004 bytes with a limit of 40 > KB > > >>>>>>>> >>> > > >>>>>>>> >>> (I trimmed out the reply chain; so this may make it through > > >>>>>>>> first) > > >>>>>>>> >>> > > >>>>>>>> >> > > >>>>>>>> >> > > >>>>>>>> > > > >>>>>>>> _______________________________________________ > > >>>>>>>> Omaha Python Users Group mailing list > > >>>>>>>> Omaha at python.org > > >>>>>>>> https://mail.python.org/mailman/listinfo/omaha > > >>>>>>>> http://www.OmahaPython.org > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org From wes.turner at gmail.com Wed Dec 21 15:11:23 2016 From: wes.turner at gmail.com (Wes Turner) Date: Wed, 21 Dec 2016 14:11:23 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner wrote: > > > On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < > luke.schollmeyer at gmail.com> wrote: > >> The quick explanation is rather than dropping outliers, I used numpy's >> log1p function to help normalize distribution of the data (for both the >> sale price and for all features over a certain skewness). I was also >> struggling with adding in more features to the model. >> > > https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html > - http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing. > FunctionTransformer.html > > > https://en.wikipedia.org/wiki/Data_transformation_(statistics)#Common_ > transformations > > https://en.wikipedia.org/wiki/Log-normal_distribution > > > How did you determine the skewness threshold? > > ... > > https://en.wikipedia.org/wiki/Maximum_entropy_probability_ > distribution#Specified_variance:_the_normal_distribution > > https://en.wikipedia.org/wiki/Normalization_(statistics) > > http://scikit-learn.org/stable/modules/preprocessing.html#normalization > - https://stackoverflow.com/questions/4674623/why-do-we-have-to-normalize-the-input-for-an-artificial-neural-network - https://stats.stackexchange.com/questions/7757/data-normalization-and-standardization-in-neural-networks > > > > >> The training and test data sets have different "completeness" of some >> features, and using pd.get_dummies can be problematic when you fit a model >> versus predicting if you don't have the same columns/features. I simply >> combined the train and test data sets (without the Id and SalePrice) and >> ran the get_dummies function over that set. >> > > autoclean_cv loads the train set first and then applies those > categorical/numerical mappings to the test set > https://github.com/rhiever/datacleaner#datacleaner-in-scripts > > When I modify load_house_prices [1] to also load test.csv in order to > autoclean_csv, > I might try assigning the categorical levels according to the ranking in > data_description.txt, > rather than the happenstance ordering in train.csv; > though get_dummies should make that irrelevant. > > https://github.com/westurner/house_prices/blob/2839ff8a/ > house_prices/data.py#L45 > > I should probably also manually specify that 'Id' is the index column in > pd.read_csv (assuming there are no duplicates, which pandas should check > for). > > >> When I needed to fit the model, I just "unraveled" the combined set with >> the train and test parts. >> >> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], >> test.loc[:,'MSSubClass':'SaleCondition'])) >> >> combined = pd.get_dummies(combined) >> >> ::: do some feature engineering ::: >> >> trainX = combined[:train.shape[0]] >> y = train['SalePrice'] >> >> Just so long you don't do anything to the combined dataframe (like >> sorting), you can slice off each part based on it's shape. >> > > http://pandas.pydata.org/pandas-docs/stable/indexing. > html#returning-a-view-versus-a-copy > > >> >> and when you would be pulling the data to predict the test data, you get >> the other part: >> >> testX = combined[train.shape[0]:] >> > > Why is the concatenation necessary? > - log1p doesn't need the whole column > - get_dummies doesn't need the whole column > > >> >> >> Luke >> >> >> > (Trimmed reply-chain (again) because 40Kb limit) From wes.turner at gmail.com Wed Dec 21 15:14:34 2016 From: wes.turner at gmail.com (Wes Turner) Date: Wed, 21 Dec 2016 14:14:34 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner wrote: > > > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner wrote: > >> >> >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < >> luke.schollmeyer at gmail.com> wrote: >> >>> The quick explanation is rather than dropping outliers, I used numpy's >>> log1p function to help normalize distribution of the data (for both the >>> sale price and for all features over a certain skewness). I was also >>> struggling with adding in more features to the model. >>> >> >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html >> - http://scikit-learn.org/stable/modules/generated/sklearn. >> preprocessing.FunctionTransformer.html >> >> >> https://en.wikipedia.org/wiki/Data_transformation_(statistic >> s)#Common_transformations >> >> https://en.wikipedia.org/wiki/Log-normal_distribution >> >> >> How did you determine the skewness threshold? >> >> ... >> >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di >> stribution#Specified_variance:_the_normal_distribution >> >> https://en.wikipedia.org/wiki/Normalization_(statistics) >> >> http://scikit-learn.org/stable/modules/preprocessing.html#normalization >> > > - https://stackoverflow.com/questions/4674623/why-do-we- > have-to-normalize-the-input-for-an-artificial-neural-network > - https://stats.stackexchange.com/questions/7757/data-normalization-and- > standardization-in-neural-networks > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/learn/python/learn > > >> >> >> >> >>> The training and test data sets have different "completeness" of some >>> features, and using pd.get_dummies can be problematic when you fit a model >>> versus predicting if you don't have the same columns/features. I simply >>> combined the train and test data sets (without the Id and SalePrice) and >>> ran the get_dummies function over that set. >>> >> >> autoclean_cv loads the train set first and then applies those >> categorical/numerical mappings to the test set >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts >> >> When I modify load_house_prices [1] to also load test.csv in order to >> autoclean_csv, >> I might try assigning the categorical levels according to the ranking in >> data_description.txt, >> rather than the happenstance ordering in train.csv; >> though get_dummies should make that irrelevant. >> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous >> e_prices/data.py#L45 >> >> I should probably also manually specify that 'Id' is the index column in >> pd.read_csv (assuming there are no duplicates, which pandas should check >> for). >> >> >>> When I needed to fit the model, I just "unraveled" the combined set with >>> the train and test parts. >>> >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], >>> test.loc[:,'MSSubClass':'SaleCondition'])) >>> >>> combined = pd.get_dummies(combined) >>> >>> ::: do some feature engineering ::: >>> >>> trainX = combined[:train.shape[0]] >>> y = train['SalePrice'] >>> >>> Just so long you don't do anything to the combined dataframe (like >>> sorting), you can slice off each part based on it's shape. >>> >> >> http://pandas.pydata.org/pandas-docs/stable/indexing.html# >> returning-a-view-versus-a-copy >> >> >>> >>> and when you would be pulling the data to predict the test data, you get >>> the other part: >>> >>> testX = combined[train.shape[0]:] >>> >> >> Why is the concatenation necessary? >> - log1p doesn't need the whole column >> - get_dummies doesn't need the whole column >> > http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html requires the whole column. ( http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler ) > >> >>> >>> >>> Luke >>> >>> >>> >> > (Trimmed reply-chain (again) because 40Kb limit) > > From luke.schollmeyer at gmail.com Wed Dec 21 14:06:34 2016 From: luke.schollmeyer at gmail.com (Luke Schollmeyer) Date: Wed, 21 Dec 2016 13:06:34 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: The quick explanation is rather than dropping outliers, I used numpy's log1p function to help normalize distribution of the data (for both the sale price and for all features over a certain skewness). I was also struggling with adding in more features to the model. The training and test data sets have different "completeness" of some features, and using pd.get_dummies can be problematic when you fit a model versus predicting if you don't have the same columns/features. I simply combined the train and test data sets (without the Id and SalePrice) and ran the get_dummies function over that set. When I needed to fit the model, I just "unraveled" the combined set with the train and test parts. combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], test.loc[:,'MSSubClass':'SaleCondition'])) combined = pd.get_dummies(combined) ::: do some feature engineering ::: trainX = combined[:train.shape[0]] y = train['SalePrice'] Just so long you don't do anything to the combined dataframe (like sorting), you can slice off each part based on it's shape. and when you would be pulling the data to predict the test data, you get the other part: testX = combined[train.shape[0]:] Luke On Wed, Dec 21, 2016 at 10:52 AM, Wes Turner wrote: > > > On Wednesday, December 21, 2016, Luke Schollmeyer via Omaha < > omaha at python.org> wrote: > >> Nice. New submission this morning moved us up 300 spots. > > > Nice work. How'd you do it? > > >> >> On Tue, Dec 20, 2016 at 8:55 PM, Bob Haffner via Omaha >> wrote: >> >> > Hi All, >> > >> > I put together a quick histogram of the leaderboard showing the >> > distribution of the scores under .25. It's a tight race! >> > https://github.com/bobhaffner/kaggle-houseprices/blob/ >> > master/kaggle_house_prices_leaderboard.ipynb >> > >> > On Mon, Dec 19, 2016 at 9:46 PM, Bob Haffner >> > wrote: >> > >> > > Hi All, I submitted another earlier today, but did not improve upon >> Wes' >> > > submission. >> > > >> > > I used the same feature set from my Saturday submission just tried >> some >> > > different regressors (lasso, elastic and Random Forest Regressor). I >> > also >> > > did some cross validation. >> > > >> > > Link to my latest notebook on github >> > > https://github.com/bobhaffner/kaggle-houseprices/blob/ >> > > master/kaggle_house_prices.ipynb >> > > >> > > Bob >> > > >> > > On Sun, Dec 18, 2016 at 6:25 PM, Bob Haffner >> > > wrote: >> > > >> > >> Looks like it's not as simple as setting a value to True so I'll let >> you >> > >> sort it out. >> > >> >> > >> >> > >> >> > >> On Sun, Dec 18, 2016 at 1:00 PM, Wes Turner >> > wrote: >> > >> >> > >>> >> > >>> On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner < >> bob.haffner at gmail.com> >> > >>> wrote: >> > >>> >> > >>>> Wes, I can try to run your process with do_get_dummies=True. >> Anything >> > >>>> else need to change? >> > >>>> >> > >>> >> > >>> Yup, >> > >>> >> > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous >> > >>> e_prices/data.py#L94 : >> > >>> >> > >>> if do_get_dummies: >> > >>> def get_categorical_columns(column_categories): >> > >>> for colkey in column_categories: >> > >>> values = column_categories[colkey] >> > >>> if len(values): >> > >>> yield colkey >> > >>> categorical_columns = list(get_categorical_columns( >> > column_categories)) >> > >>> get_dummies_dict = {key: key for key in categorical_columns} >> > >>> df = pd.get_dummies(df, prefix=get_dummies_dict, >> > columns=get_dummies_dict) >> > >>> >> > >>> Needs to also be applied to train_csv and test_csv in the generated >> and >> > >>> modified pipeline: >> > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous >> > >>> e_prices/pipelines/tpot_house_prices__001__modified.py#L40 >> > >>> >> > >>> So, I can either copy/paste or factor or it out: >> > >>> >> > >>> - copy/paste: just wrong >> > >>> - factor it out: >> > >>> - this creates a (new) dependency on house_prices from within the >> > >>> generated pipeline; which currently depends on [stable versions of] >> > >>> (datacleaner, pandas, and scikit-learn) >> > >>> >> > >>> ... TODO: today >> > >>> >> > >>> - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df) >> > >>> - [ ] Dockerfile >> > >>> - probably the easiest way to reproduce the environment.yml >> > >>> - [ ] automate the __modified.py patching process : >> > >>> >> > >>> >> > >>> # git clone ssh://git at github.com/westurner/house_prices >> # >> > >>> -b develop >> > >>> conda env update -f ./environment.yml >> > >>> cd house_prices/ >> > >>> >> > >>> python ./analysis.py >> > >>> >> > >>> # (wait) >> > >>> >> > >>> mv ./pipelines/tpot_house_prices_.py \ >> > >>> ./pipelines/tpot_house_prices__002.py >> > >>> mv ./pipelines/tpot_house_prices_.py.json \ >> > >>> ./pipelines/tpot_house_prices__002.py.json >> > >>> cp ./pipelines/tpot_house_prices__001__modified.py \ >> > >>> ./pipelines/tpot_house_prices__002__modified.py >> > >>> # copy/paste (TODO: patch/template): >> > >>> # - exported_pipeline / self.exported_pipeline >> > >>> # - the sklearn imports] to __002__modified.py >> > >>> cd pipelines/ # TODO: settings.json >> > >>> python ./tpot_house_prices__002__modified.py >> > >>> >> > >>> >> > >>> ... The modified pipeline generation is not quite reproducible yet, >> but >> > >>> the generated pipeline (tpot_house_prices__001[__modified].py) is. >> > >>> (With ~2% error ... only about ~$6mn dollars off :|) >> > >>> >> > >>> >> > >>> >> > >>>> Sent from my iPhone >> > >>>> >> > >>>> On Dec 18, 2016, at 10:59 AM, Wes Turner >> > wrote: >> > >>>> >> > >>>> Thanks, Bob! >> > >>>> >> > >>>> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner < >> bob.haffner at gmail.com> >> > >>>> wrote: >> > >>>> >> > >>>>> Nice job, Wes!! >> > >>>>> >> > >>>>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner > > >> > >>>>> wrote: >> > >>>>> >> > >>>>>> In addition to posting to the mailing list, I created a comment >> on >> > >>>>>> the "Kaggle Submissions" issue [1]: >> > >>>>>> >> > >>>>>> - Score: 0.13667 (#1370) >> > >>>>>>> - https://www.kaggle.com/c/house >> -prices-advanced-regression-te >> > >>>>>>> chniques/leaderboard?submissionId=3925119 >> > >>>>>>> - https://mail.python.org/piperm >> ail/omaha/2016-December/002206 >> > >>>>>>> .html >> > >>>>>>> - https://github.com/westurner/h >> ouse_prices/blob/2839ff8a/hous >> > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >> > >>>>>> >> > >>>>>> >> > >>>>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 >> > >>>>>> >> > >>>>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner < >> wes.turner at gmail.com> >> > >>>>>> wrote: >> > >>>>>> >> > >>>>>>> Sounds great. 1/18. >> > >>>>>>> >> > >>>>>>> I just submitted my first submission.csv to Kaggle! [1] >> > >>>>>>> >> > >>>>>>> $ python ./tpot_house_prices__001__modified.py >> > >>>>>>> class_sum: 264144946 >> > >>>>>>> abs error: 5582809.288 >> > >>>>>>> % error: 2.11354007432 % >> > >>>>>>> error**2: 252508654837.0 >> > >>>>>>> # python ./tpot_house_prices__001__modified.py >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> ... Which moves us up to #1370! >> > >>>>>>> >> > >>>>>>> Your Best Entry ? >> > >>>>>>> You improved on your best score by 0.02469. >> > >>>>>>> You just moved up 608 positions on the leaderboard. >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> I have a few more things to try: >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> - Manually drop the 'Id' column >> > >>>>>>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance >> > >>>>>>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ >> > >>>>>>> verbosity=2) >> > >>>>>>> - https://github.com/westurner/h >> ouse_prices/blob/2839ff8a/ >> > hous >> > >>>>>>> e_prices/data.py#L94 >> > >>>>>>> - skleanGridSearch and/or sklearn-deap the TPOT >> > >>>>>>> hyperparameters >> > >>>>>>> - http://scikit-learn.org/stable/modules/generated/ >> > sklearn.mod >> > >>>>>>> el_selection.GridSearchCV.htm >> l#sklearn.model_selection.GridS >> > >>>>>>> earchCV >> > >>>>>>> > > sklearn.model_selection.GridSearchCV.html#sklearn. >> > model_selection.GridSearchCV> >> > >>>>>>> - https://github.com/rsteca/sklearn-deap >> > >>>>>>> - REF,BLD,DOC,TST: >> > >>>>>>> - factor constants out in favor of settings.json and >> data.py >> > >>>>>>> - https://github.com/omahapython >> > >>>>>>> /kaggle-houseprices/blob/master/src/data.py >> > >>>>>>> > > houseprices/blob/master/src/data.py> >> > >>>>>>> - implement train.py and predict.py, too >> > >>>>>>> - create a Dockerfile FROM kaggle/docker-python:latest >> > >>>>>>> - https://github.com/omahapython/datascience/issues/3 >> > >>>>>>> "Kaggle Best Practices" >> > >>>>>>> - docstrings, tests >> > >>>>>>> - https://github.com/omahapython/datascience/wiki/resources >> > >>>>>>> >> > >>>>>>> [1] https://github.com/westurner/h >> ouse_prices/blob/2839ff8a/hous >> > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >> > >>>>>>> >> > >>>>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha < >> > >>>>>>> omaha at python.org> wrote: >> > >>>>>>> >> > >>>>>>>> Hey all, regarding our January kaggle meetup that we talked >> about. >> > >>>>>>>> Maybe >> > >>>>>>>> we can meet following our regular monthly (1/18). >> > >>>>>>>> >> > >>>>>>>> Would that be easier/better for everyone? >> > >>>>>>>> >> > >>>>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner < >> > bob.haffner at gmail.com> >> > >>>>>>>> wrote: >> > >>>>>>>> >> > >>>>>>>> > Just submitted another Linear Regression attempt (0.16136). >> > >>>>>>>> Added some >> > >>>>>>>> > features, both numeric and categorical, and created 3 >> numerics >> > >>>>>>>> > >> > >>>>>>>> > -TotalFullBaths >> > >>>>>>>> > -TotalHalfBaths >> > >>>>>>>> > -Pool >> > >>>>>>>> > >> > >>>>>>>> > Notebook attached >> > >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner < >> > >>>>>>>> bob.haffner at gmail.com> >> > >>>>>>>> > wrote: >> > >>>>>>>> > >> > >>>>>>>> >> Just submitted another Linear Regression attempt (0.16136). >> > >>>>>>>> Added some >> > >>>>>>>> >> features, both numeric and categorical, and created 3 >> numerics >> > >>>>>>>> >> >> > >>>>>>>> >> -TotalFullBaths >> > >>>>>>>> >> -TotalHalfBaths >> > >>>>>>>> >> -Pool >> > >>>>>>>> >> >> > >>>>>>>> >> Notebook attached >> > >>>>>>>> >> >> > >>>>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner < >> > >>>>>>>> wes.turner at gmail.com> wrote: >> > >>>>>>>> >> >> > >>>>>>>> >>> >> > >>>>>>>> >>> >> > >>>>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner < >> > >>>>>>>> wes.turner at gmail.com> >> > >>>>>>>> >>> wrote: >> > >>>>>>>> >>> >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >> > >>>>>>>> >>>> omaha at python.org> wrote: >> > >>>>>>>> >>>> >> > >>>>>>>> >>>>> >Does Kaggle take the high mark but still give a score >> for >> > >>>>>>>> each >> > >>>>>>>> >>>>> submission? >> > >>>>>>>> >>>>> Yes. >> > >>>>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced- >> > regression-te >> > >>>>>>>> >>>>> chniques/submissions >> > >>>>>>>> >>>>> >> > >>>>>>>> >>>>> >> > >>>>>>>> >>>>> >Thinking of ways to keep track of which code produced >> which >> > >>>>>>>> score; >> > >>>>>>>> >>>>> I'll >> > >>>>>>>> >>>>> >post about the GitHub setup in a bit. >> > >>>>>>>> >>>>> We could push our notebooks to the github repo? Maybe >> > >>>>>>>> include a brief >> > >>>>>>>> >>>>> description at the top in a markdown cell >> > >>>>>>>> >>>>> >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> In my research [1], I found that the preferred folder >> > >>>>>>>> structure for >> > >>>>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and >> > >>>>>>>> working/ >> > >>>>>>>> >>>> (outputs); >> > >>>>>>>> >>>> and that they recommend creating a settings.json with path >> > >>>>>>>> >>>> configuration (e.g. pointing to input/, src/ data/) >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> So, we could put notebooks, folders, and repos in src/ >> [2]. >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> runipy is a bit more scriptable than requiring notebook >> gui >> > >>>>>>>> >>>> interactions [3]. >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> We could either hardcode '../input/test.csv' in our .py >> and >> > >>>>>>>> .ipnb >> > >>>>>>>> >>>> sources, or we could write a function in src/data.py to >> read >> > >>>>>>>> >>>> '../settings.json' into a dict with the recommended >> variable >> > >>>>>>>> names [1]: >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> from data import read_settings_json >> > >>>>>>>> >>>> settings = read_settings_json() >> > >>>>>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >> > >>>>>>>> >>>> # .... >> > >>>>>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> [1] https://github.com/omahapython >> > >>>>>>>> /datascience/issues/3#issuecom >> > >>>>>>>> >>>> ment-267236556 >> > >>>>>>>> >>>> [2] https://github.com/omahapython >> > >>>>>>>> /kaggle-houseprices/tree/master/src >> > >>>>>>>> >>>> [3] https://pypi.python.org/pypi/runipy >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> >> > >>>>>>>> >>>>> >> > >>>>>>>> >>>>> I initially thought github was a good way to go, but I >> don't >> > >>>>>>>> know if >> > >>>>>>>> >>>>> everyone has a github acct or is interested in starting >> one. >> > >>>>>>>> Maybe >> > >>>>>>>> >>>>> email >> > >>>>>>>> >>>>> is the way to go? >> > >>>>>>>> >>>>> >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> I'm all for GitHub: >> > >>>>>>>> >>>> >> > >>>>>>>> >>>> - git source control and revision numbers >> > >>>>>>>> >>>> - we're not able to easily share code in the mailing list >> > >>>>>>>> >>>> - we can learn from each others' solutions >> > >>>>>>>> >>>> >> > >>>>>>>> >>> >> > >>>>>>>> >>> An example of mailing list limitations: >> > >>>>>>>> >>> >> > >>>>>>>> >>> >> > >>>>>>>> >>> Your mail to 'Omaha' with the subject >> > >>>>>>>> >>> >> > >>>>>>>> >>> Re: [omaha] Group Data Science Competition >> > >>>>>>>> >>> >> > >>>>>>>> >>> Is being held until the list moderator can review it for >> > >>>>>>>> approval. >> > >>>>>>>> >>> >> > >>>>>>>> >>> The reason it is being held: >> > >>>>>>>> >>> >> > >>>>>>>> >>> Message body is too big: 47004 bytes with a limit of >> 40 KB >> > >>>>>>>> >>> >> > >>>>>>>> >>> (I trimmed out the reply chain; so this may make it >> through >> > >>>>>>>> first) >> > >>>>>>>> >>> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> > >> > >>>>>>>> _______________________________________________ >> > >>>>>>>> Omaha Python Users Group mailing list >> > >>>>>>>> Omaha at python.org >> > >>>>>>>> https://mail.python.org/mailman/listinfo/omaha >> > >>>>>>>> http://www.OmahaPython.org >> > >>>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>> >> > >>>>> >> > >>>> >> > >>> >> > >> >> > > >> > _______________________________________________ >> > Omaha Python Users Group mailing list >> > Omaha at python.org >> > https://mail.python.org/mailman/listinfo/omaha >> > http://www.OmahaPython.org >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org > > From wes.turner at gmail.com Wed Dec 21 14:41:45 2016 From: wes.turner at gmail.com (Wes Turner) Date: Wed, 21 Dec 2016 13:41:45 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < luke.schollmeyer at gmail.com> wrote: > The quick explanation is rather than dropping outliers, I used numpy's > log1p function to help normalize distribution of the data (for both the > sale price and for all features over a certain skewness). I was also > struggling with adding in more features to the model. > https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html - http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html https://en.wikipedia.org/wiki/Data_transformation_(statistics)#Common_transformations https://en.wikipedia.org/wiki/Log-normal_distribution How did you determine the skewness threshold? ... https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution#Specified_variance:_the_normal_distribution https://en.wikipedia.org/wiki/Normalization_(statistics) http://scikit-learn.org/stable/modules/preprocessing.html#normalization > The training and test data sets have different "completeness" of some > features, and using pd.get_dummies can be problematic when you fit a model > versus predicting if you don't have the same columns/features. I simply > combined the train and test data sets (without the Id and SalePrice) and > ran the get_dummies function over that set. > autoclean_cv loads the train set first and then applies those categorical/numerical mappings to the test set https://github.com/rhiever/datacleaner#datacleaner-in-scripts When I modify load_house_prices [1] to also load test.csv in order to autoclean_csv, I might try assigning the categorical levels according to the ranking in data_description.txt, rather than the happenstance ordering in train.csv; though get_dummies should make that irrelevant. https://github.com/westurner/house_prices/blob/2839ff8a/house_prices/data.py#L45 I should probably also manually specify that 'Id' is the index column in pd.read_csv (assuming there are no duplicates, which pandas should check for). > When I needed to fit the model, I just "unraveled" the combined set with > the train and test parts. > > combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], > test.loc[:,'MSSubClass':'SaleCondition'])) > > combined = pd.get_dummies(combined) > > ::: do some feature engineering ::: > > trainX = combined[:train.shape[0]] > y = train['SalePrice'] > > Just so long you don't do anything to the combined dataframe (like > sorting), you can slice off each part based on it's shape. > http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy > > and when you would be pulling the data to predict the test data, you get > the other part: > > testX = combined[train.shape[0]:] > Why is the concatenation necessary? - log1p doesn't need the whole column - get_dummies doesn't need the whole column > > > Luke > > > On Wed, Dec 21, 2016 at 10:52 AM, Wes Turner wrote: > >> >> >> On Wednesday, December 21, 2016, Luke Schollmeyer via Omaha < >> omaha at python.org> wrote: >> >>> Nice. New submission this morning moved us up 300 spots. >> >> >> Nice work. How'd you do it? >> >> >>> >>> On Tue, Dec 20, 2016 at 8:55 PM, Bob Haffner via Omaha >> > >>> wrote: >>> >>> > Hi All, >>> > >>> > I put together a quick histogram of the leaderboard showing the >>> > distribution of the scores under .25. It's a tight race! >>> > https://github.com/bobhaffner/kaggle-houseprices/blob/ >>> > master/kaggle_house_prices_leaderboard.ipynb >>> > >>> > On Mon, Dec 19, 2016 at 9:46 PM, Bob Haffner >>> > wrote: >>> > >>> > > Hi All, I submitted another earlier today, but did not improve upon >>> Wes' >>> > > submission. >>> > > >>> > > I used the same feature set from my Saturday submission just tried >>> some >>> > > different regressors (lasso, elastic and Random Forest Regressor). I >>> > also >>> > > did some cross validation. >>> > > >>> > > Link to my latest notebook on github >>> > > https://github.com/bobhaffner/kaggle-houseprices/blob/ >>> > > master/kaggle_house_prices.ipynb >>> > > >>> > > Bob >>> > > >>> > > On Sun, Dec 18, 2016 at 6:25 PM, Bob Haffner >>> > > wrote: >>> > > >>> > >> Looks like it's not as simple as setting a value to True so I'll >>> let you >>> > >> sort it out. >>> > >> >>> > >> >>> > >> >>> > >> On Sun, Dec 18, 2016 at 1:00 PM, Wes Turner >>> > wrote: >>> > >> >>> > >>> >>> > >>> On Sun, Dec 18, 2016 at 11:55 AM, Bob Haffner < >>> bob.haffner at gmail.com> >>> > >>> wrote: >>> > >>> >>> > >>>> Wes, I can try to run your process with do_get_dummies=True. >>> Anything >>> > >>>> else need to change? >>> > >>>> >>> > >>> >>> > >>> Yup, >>> > >>> >>> > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>> > >>> e_prices/data.py#L94 : >>> > >>> >>> > >>> if do_get_dummies: >>> > >>> def get_categorical_columns(column_categories): >>> > >>> for colkey in column_categories: >>> > >>> values = column_categories[colkey] >>> > >>> if len(values): >>> > >>> yield colkey >>> > >>> categorical_columns = list(get_categorical_columns( >>> > column_categories)) >>> > >>> get_dummies_dict = {key: key for key in >>> categorical_columns} >>> > >>> df = pd.get_dummies(df, prefix=get_dummies_dict, >>> > columns=get_dummies_dict) >>> > >>> >>> > >>> Needs to also be applied to train_csv and test_csv in the >>> generated and >>> > >>> modified pipeline: >>> > >>> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>> > >>> e_prices/pipelines/tpot_house_prices__001__modified.py#L40 >>> > >>> >>> > >>> So, I can either copy/paste or factor or it out: >>> > >>> >>> > >>> - copy/paste: just wrong >>> > >>> - factor it out: >>> > >>> - this creates a (new) dependency on house_prices from within the >>> > >>> generated pipeline; which currently depends on [stable versions of] >>> > >>> (datacleaner, pandas, and scikit-learn) >>> > >>> >>> > >>> ... TODO: today >>> > >>> >>> > >>> - [ ] pd.get_dummies(train_df), pd.get_dummies(test_df) >>> > >>> - [ ] Dockerfile >>> > >>> - probably the easiest way to reproduce the environment.yml >>> > >>> - [ ] automate the __modified.py patching process : >>> > >>> >>> > >>> >>> > >>> # git clone ssh://git at github.com/westurner/house_prices >>> # >>> > >>> -b develop >>> > >>> conda env update -f ./environment.yml >>> > >>> cd house_prices/ >>> > >>> >>> > >>> python ./analysis.py >>> > >>> >>> > >>> # (wait) >>> > >>> >>> > >>> mv ./pipelines/tpot_house_prices_.py \ >>> > >>> ./pipelines/tpot_house_prices__002.py >>> > >>> mv ./pipelines/tpot_house_prices_.py.json \ >>> > >>> ./pipelines/tpot_house_prices__002.py.json >>> > >>> cp ./pipelines/tpot_house_prices__001__modified.py \ >>> > >>> ./pipelines/tpot_house_prices__002__modified.py >>> > >>> # copy/paste (TODO: patch/template): >>> > >>> # - exported_pipeline / self.exported_pipeline >>> > >>> # - the sklearn imports] to __002__modified.py >>> > >>> cd pipelines/ # TODO: settings.json >>> > >>> python ./tpot_house_prices__002__modified.py >>> > >>> >>> > >>> >>> > >>> ... The modified pipeline generation is not quite reproducible >>> yet, but >>> > >>> the generated pipeline (tpot_house_prices__001[__modified].py) is. >>> > >>> (With ~2% error ... only about ~$6mn dollars off :|) >>> > >>> >>> > >>> >>> > >>> >>> > >>>> Sent from my iPhone >>> > >>>> >>> > >>>> On Dec 18, 2016, at 10:59 AM, Wes Turner >>> > wrote: >>> > >>>> >>> > >>>> Thanks, Bob! >>> > >>>> >>> > >>>> On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner < >>> bob.haffner at gmail.com> >>> > >>>> wrote: >>> > >>>> >>> > >>>>> Nice job, Wes!! >>> > >>>>> >>> > >>>>> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner < >>> wes.turner at gmail.com> >>> > >>>>> wrote: >>> > >>>>> >>> > >>>>>> In addition to posting to the mailing list, I created a comment >>> on >>> > >>>>>> the "Kaggle Submissions" issue [1]: >>> > >>>>>> >>> > >>>>>> - Score: 0.13667 (#1370) >>> > >>>>>>> - https://www.kaggle.com/c/house >>> -prices-advanced-regression-te >>> > >>>>>>> chniques/leaderboard?submissionId=3925119 >>> > >>>>>>> - https://mail.python.org/piperm >>> ail/omaha/2016-December/002206 >>> > >>>>>>> .html >>> > >>>>>>> - https://github.com/westurner/h >>> ouse_prices/blob/2839ff8a/hous >>> > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2 >>> > >>>>>> >>> > >>>>>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner < >>> wes.turner at gmail.com> >>> > >>>>>> wrote: >>> > >>>>>> >>> > >>>>>>> Sounds great. 1/18. >>> > >>>>>>> >>> > >>>>>>> I just submitted my first submission.csv to Kaggle! [1] >>> > >>>>>>> >>> > >>>>>>> $ python ./tpot_house_prices__001__modified.py >>> > >>>>>>> class_sum: 264144946 >>> > >>>>>>> abs error: 5582809.288 >>> > >>>>>>> % error: 2.11354007432 % >>> > >>>>>>> error**2: 252508654837.0 >>> > >>>>>>> # python ./tpot_house_prices__001__modified.py >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> ... Which moves us up to #1370! >>> > >>>>>>> >>> > >>>>>>> Your Best Entry ? >>> > >>>>>>> You improved on your best score by 0.02469. >>> > >>>>>>> You just moved up 608 positions on the leaderboard. >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> I have a few more things to try: >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> - Manually drop the 'Id' column >>> > >>>>>>> - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance >>> > >>>>>>> - I got an oom error w/ an 8GB notebook (at 25/120 w/ >>> > >>>>>>> verbosity=2) >>> > >>>>>>> - https://github.com/westurner/h >>> ouse_prices/blob/2839ff8a/ >>> > hous >>> > >>>>>>> e_prices/data.py#L94 >>> > >>>>>>> - skleanGridSearch and/or sklearn-deap the TPOT >>> > >>>>>>> hyperparameters >>> > >>>>>>> - http://scikit-learn.org/stable/modules/generated/ >>> > sklearn.mod >>> > >>>>>>> el_selection.GridSearchCV.htm >>> l#sklearn.model_selection.GridS >>> > >>>>>>> earchCV >>> > >>>>>>> >> > sklearn.model_selection.GridSearchCV.html#sklearn. >>> > model_selection.GridSearchCV> >>> > >>>>>>> - https://github.com/rsteca/sklearn-deap >>> > >>>>>>> - REF,BLD,DOC,TST: >>> > >>>>>>> - factor constants out in favor of settings.json and >>> data.py >>> > >>>>>>> - https://github.com/omahapython >>> > >>>>>>> /kaggle-houseprices/blob/master/src/data.py >>> > >>>>>>> >> > houseprices/blob/master/src/data.py> >>> > >>>>>>> - implement train.py and predict.py, too >>> > >>>>>>> - create a Dockerfile FROM kaggle/docker-python:latest >>> > >>>>>>> - https://github.com/omahapython/datascience/issues/3 >>> > >>>>>>> "Kaggle Best Practices" >>> > >>>>>>> - docstrings, tests >>> > >>>>>>> - https://github.com/omahapython/datascience/wiki/resources >>> > >>>>>>> >>> > >>>>>>> [1] https://github.com/westurner/h >>> ouse_prices/blob/2839ff8a/hous >>> > >>>>>>> e_prices/pipelines/tpot_house_prices__001__modified.py >>> > >>>>>>> >>> > >>>>>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha < >>> > >>>>>>> omaha at python.org> wrote: >>> > >>>>>>> >>> > >>>>>>>> Hey all, regarding our January kaggle meetup that we talked >>> about. >>> > >>>>>>>> Maybe >>> > >>>>>>>> we can meet following our regular monthly (1/18). >>> > >>>>>>>> >>> > >>>>>>>> Would that be easier/better for everyone? >>> > >>>>>>>> >>> > >>>>>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner < >>> > bob.haffner at gmail.com> >>> > >>>>>>>> wrote: >>> > >>>>>>>> >>> > >>>>>>>> > Just submitted another Linear Regression attempt (0.16136). >>> > >>>>>>>> Added some >>> > >>>>>>>> > features, both numeric and categorical, and created 3 >>> numerics >>> > >>>>>>>> > >>> > >>>>>>>> > -TotalFullBaths >>> > >>>>>>>> > -TotalHalfBaths >>> > >>>>>>>> > -Pool >>> > >>>>>>>> > >>> > >>>>>>>> > Notebook attached >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner < >>> > >>>>>>>> bob.haffner at gmail.com> >>> > >>>>>>>> > wrote: >>> > >>>>>>>> > >>> > >>>>>>>> >> Just submitted another Linear Regression attempt (0.16136). >>> > >>>>>>>> Added some >>> > >>>>>>>> >> features, both numeric and categorical, and created 3 >>> numerics >>> > >>>>>>>> >> >>> > >>>>>>>> >> -TotalFullBaths >>> > >>>>>>>> >> -TotalHalfBaths >>> > >>>>>>>> >> -Pool >>> > >>>>>>>> >> >>> > >>>>>>>> >> Notebook attached >>> > >>>>>>>> >> >>> > >>>>>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner < >>> > >>>>>>>> wes.turner at gmail.com> wrote: >>> > >>>>>>>> >> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner < >>> > >>>>>>>> wes.turner at gmail.com> >>> > >>>>>>>> >>> wrote: >>> > >>>>>>>> >>> >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha < >>> > >>>>>>>> >>>> omaha at python.org> wrote: >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>>> >Does Kaggle take the high mark but still give a score >>> for >>> > >>>>>>>> each >>> > >>>>>>>> >>>>> submission? >>> > >>>>>>>> >>>>> Yes. >>> > >>>>>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced- >>> > regression-te >>> > >>>>>>>> >>>>> chniques/submissions >>> > >>>>>>>> >>>>> >>> > >>>>>>>> >>>>> >>> > >>>>>>>> >>>>> >Thinking of ways to keep track of which code produced >>> which >>> > >>>>>>>> score; >>> > >>>>>>>> >>>>> I'll >>> > >>>>>>>> >>>>> >post about the GitHub setup in a bit. >>> > >>>>>>>> >>>>> We could push our notebooks to the github repo? Maybe >>> > >>>>>>>> include a brief >>> > >>>>>>>> >>>>> description at the top in a markdown cell >>> > >>>>>>>> >>>>> >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> In my research [1], I found that the preferred folder >>> > >>>>>>>> structure for >>> > >>>>>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and >>> > >>>>>>>> working/ >>> > >>>>>>>> >>>> (outputs); >>> > >>>>>>>> >>>> and that they recommend creating a settings.json with >>> path >>> > >>>>>>>> >>>> configuration (e.g. pointing to input/, src/ data/) >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> So, we could put notebooks, folders, and repos in src/ >>> [2]. >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> runipy is a bit more scriptable than requiring notebook >>> gui >>> > >>>>>>>> >>>> interactions [3]. >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> We could either hardcode '../input/test.csv' in our .py >>> and >>> > >>>>>>>> .ipnb >>> > >>>>>>>> >>>> sources, or we could write a function in src/data.py to >>> read >>> > >>>>>>>> >>>> '../settings.json' into a dict with the recommended >>> variable >>> > >>>>>>>> names [1]: >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> from data import read_settings_json >>> > >>>>>>>> >>>> settings = read_settings_json() >>> > >>>>>>>> >>>> train = pd.read_csv(settings['TRAIN_DATA_PATH']) >>> > >>>>>>>> >>>> # .... >>> > >>>>>>>> >>>> pd.write_csv(settings['SUBMISSION_PATH']) >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> [1] https://github.com/omahapython >>> > >>>>>>>> /datascience/issues/3#issuecom >>> > >>>>>>>> >>>> ment-267236556 >>> > >>>>>>>> >>>> [2] https://github.com/omahapython >>> > >>>>>>>> /kaggle-houseprices/tree/master/src >>> > >>>>>>>> >>>> [3] https://pypi.python.org/pypi/runipy >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>>> >>> > >>>>>>>> >>>>> I initially thought github was a good way to go, but I >>> don't >>> > >>>>>>>> know if >>> > >>>>>>>> >>>>> everyone has a github acct or is interested in starting >>> one. >>> > >>>>>>>> Maybe >>> > >>>>>>>> >>>>> email >>> > >>>>>>>> >>>>> is the way to go? >>> > >>>>>>>> >>>>> >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> I'm all for GitHub: >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>>> - git source control and revision numbers >>> > >>>>>>>> >>>> - we're not able to easily share code in the mailing list >>> > >>>>>>>> >>>> - we can learn from each others' solutions >>> > >>>>>>>> >>>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> An example of mailing list limitations: >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> Your mail to 'Omaha' with the subject >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> Re: [omaha] Group Data Science Competition >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> Is being held until the list moderator can review it for >>> > >>>>>>>> approval. >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> The reason it is being held: >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> Message body is too big: 47004 bytes with a limit of >>> 40 KB >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> (I trimmed out the reply chain; so this may make it >>> through >>> > >>>>>>>> first) >>> > >>>>>>>> >>> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> > >>> > >>>>>>>> _______________________________________________ >>> > >>>>>>>> Omaha Python Users Group mailing list >>> > >>>>>>>> Omaha at python.org >>> > >>>>>>>> https://mail.python.org/mailman/listinfo/omaha >>> > >>>>>>>> http://www.OmahaPython.org >>> > >>>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>> >>> > >>>>> >>> > >>>> >>> > >>> >>> > >> >>> > > >>> > _______________________________________________ >>> > Omaha Python Users Group mailing list >>> > Omaha at python.org >>> > https://mail.python.org/mailman/listinfo/omaha >>> > http://www.OmahaPython.org >>> > >>> _______________________________________________ >>> Omaha Python Users Group mailing list >>> Omaha at python.org >>> https://mail.python.org/mailman/listinfo/omaha >>> http://www.OmahaPython.org >> >> > From bob.haffner at gmail.com Thu Dec 22 21:47:26 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Thu, 22 Dec 2016 20:47:26 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Made a TPOT attempt tonight. Could only do some numeric features though because including categoricals would cause my ipython kernel to die. I will try a bigger box this weekend On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha wrote: > On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner wrote: > > > > > > > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner > wrote: > > > >> > >> > >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < > >> luke.schollmeyer at gmail.com> wrote: > >> > >>> The quick explanation is rather than dropping outliers, I used numpy's > >>> log1p function to help normalize distribution of the data (for both the > >>> sale price and for all features over a certain skewness). I was also > >>> struggling with adding in more features to the model. > >>> > >> > >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html > >> - http://scikit-learn.org/stable/modules/generated/sklearn. > >> preprocessing.FunctionTransformer.html > >> > >> > >> https://en.wikipedia.org/wiki/Data_transformation_(statistic > >> s)#Common_transformations > >> > >> https://en.wikipedia.org/wiki/Log-normal_distribution > >> > >> > >> How did you determine the skewness threshold? > >> > >> ... > >> > >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di > >> stribution#Specified_variance:_the_normal_distribution > >> > >> https://en.wikipedia.org/wiki/Normalization_(statistics) > >> > >> http://scikit-learn.org/stable/modules/preprocessing.html#normalization > >> > > > > - https://stackoverflow.com/questions/4674623/why-do-we- > > have-to-normalize-the-input-for-an-artificial-neural-network > > - https://stats.stackexchange.com/questions/7757/data-normalization-and- > > standardization-in-neural-networks > > > > https://github.com/tensorflow/tensorflow/tree/master/ > tensorflow/contrib/learn/python/learn > > > > > > > >> > >> > >> > >> > >>> The training and test data sets have different "completeness" of some > >>> features, and using pd.get_dummies can be problematic when you fit a > model > >>> versus predicting if you don't have the same columns/features. I simply > >>> combined the train and test data sets (without the Id and SalePrice) > and > >>> ran the get_dummies function over that set. > >>> > >> > >> autoclean_cv loads the train set first and then applies those > >> categorical/numerical mappings to the test set > >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts > >> > >> When I modify load_house_prices [1] to also load test.csv in order to > >> autoclean_csv, > >> I might try assigning the categorical levels according to the ranking in > >> data_description.txt, > >> rather than the happenstance ordering in train.csv; > >> though get_dummies should make that irrelevant. > >> > >> https://github.com/westurner/house_prices/blob/2839ff8a/hous > >> e_prices/data.py#L45 > >> > >> I should probably also manually specify that 'Id' is the index column in > >> pd.read_csv (assuming there are no duplicates, which pandas should check > >> for). > >> > >> > >>> When I needed to fit the model, I just "unraveled" the combined set > with > >>> the train and test parts. > >>> > >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], > >>> test.loc[:,'MSSubClass':'SaleCondition'])) > >>> > >>> combined = pd.get_dummies(combined) > >>> > >>> ::: do some feature engineering ::: > >>> > >>> trainX = combined[:train.shape[0]] > >>> y = train['SalePrice'] > >>> > >>> Just so long you don't do anything to the combined dataframe (like > >>> sorting), you can slice off each part based on it's shape. > >>> > >> > >> http://pandas.pydata.org/pandas-docs/stable/indexing.html# > >> returning-a-view-versus-a-copy > >> > >> > >>> > >>> and when you would be pulling the data to predict the test data, you > get > >>> the other part: > >>> > >>> testX = combined[train.shape[0]:] > >>> > >> > >> Why is the concatenation necessary? > >> - log1p doesn't need the whole column > >> - get_dummies doesn't need the whole column > >> > > > http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing. > StandardScaler.html > requires the whole column. > > ( > http://scikit-learn.org/stable/modules/preprocessing. > html#preprocessing-scaler > ) > > > > > > > >> > >>> > >>> > >>> Luke > >>> > >>> > >>> > >> > > (Trimmed reply-chain (again) because 40Kb limit) > > > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From luke.schollmeyer at gmail.com Fri Dec 23 09:03:00 2016 From: luke.schollmeyer at gmail.com (Luke Schollmeyer) Date: Fri, 23 Dec 2016 08:03:00 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Moved the needle a little bit yesterday with a ridge regression attempt using the same feature engineering I described before. Luke On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner wrote: > Made a TPOT attempt tonight. Could only do some numeric features though > because including categoricals would cause my ipython kernel to die. > > I will try a bigger box this weekend > > On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha > wrote: > >> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner wrote: >> >> > >> > >> > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner >> wrote: >> > >> >> >> >> >> >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < >> >> luke.schollmeyer at gmail.com> wrote: >> >> >> >>> The quick explanation is rather than dropping outliers, I used numpy's >> >>> log1p function to help normalize distribution of the data (for both >> the >> >>> sale price and for all features over a certain skewness). I was also >> >>> struggling with adding in more features to the model. >> >>> >> >> >> >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html >> >> - http://scikit-learn.org/stable/modules/generated/sklearn. >> >> preprocessing.FunctionTransformer.html >> >> >> >> >> >> https://en.wikipedia.org/wiki/Data_transformation_(statistic >> >> s)#Common_transformations >> >> >> >> https://en.wikipedia.org/wiki/Log-normal_distribution >> >> >> >> >> >> How did you determine the skewness threshold? >> >> >> >> ... >> >> >> >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di >> >> stribution#Specified_variance:_the_normal_distribution >> >> >> >> https://en.wikipedia.org/wiki/Normalization_(statistics) >> >> >> >> http://scikit-learn.org/stable/modules/preprocessing.html# >> normalization >> >> >> > >> > - https://stackoverflow.com/questions/4674623/why-do-we- >> > have-to-normalize-the-input-for-an-artificial-neural-network >> > - https://stats.stackexchange.com/questions/7757/data-normaliz >> ation-and- >> > standardization-in-neural-networks >> > >> >> https://github.com/tensorflow/tensorflow/tree/master/tensorf >> low/contrib/learn/python/learn >> >> >> > >> > >> >> >> >> >> >> >> >> >> >>> The training and test data sets have different "completeness" of some >> >>> features, and using pd.get_dummies can be problematic when you fit a >> model >> >>> versus predicting if you don't have the same columns/features. I >> simply >> >>> combined the train and test data sets (without the Id and SalePrice) >> and >> >>> ran the get_dummies function over that set. >> >>> >> >> >> >> autoclean_cv loads the train set first and then applies those >> >> categorical/numerical mappings to the test set >> >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts >> >> >> >> When I modify load_house_prices [1] to also load test.csv in order to >> >> autoclean_csv, >> >> I might try assigning the categorical levels according to the ranking >> in >> >> data_description.txt, >> >> rather than the happenstance ordering in train.csv; >> >> though get_dummies should make that irrelevant. >> >> >> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous >> >> e_prices/data.py#L45 >> >> >> >> I should probably also manually specify that 'Id' is the index column >> in >> >> pd.read_csv (assuming there are no duplicates, which pandas should >> check >> >> for). >> >> >> >> >> >>> When I needed to fit the model, I just "unraveled" the combined set >> with >> >>> the train and test parts. >> >>> >> >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], >> >>> test.loc[:,'MSSubClass':'SaleCondition'])) >> >>> >> >>> combined = pd.get_dummies(combined) >> >>> >> >>> ::: do some feature engineering ::: >> >>> >> >>> trainX = combined[:train.shape[0]] >> >>> y = train['SalePrice'] >> >>> >> >>> Just so long you don't do anything to the combined dataframe (like >> >>> sorting), you can slice off each part based on it's shape. >> >>> >> >> >> >> http://pandas.pydata.org/pandas-docs/stable/indexing.html# >> >> returning-a-view-versus-a-copy >> >> >> >> >> >>> >> >>> and when you would be pulling the data to predict the test data, you >> get >> >>> the other part: >> >>> >> >>> testX = combined[train.shape[0]:] >> >>> >> >> >> >> Why is the concatenation necessary? >> >> - log1p doesn't need the whole column >> >> - get_dummies doesn't need the whole column >> >> >> > >> http://scikit-learn.org/stable/modules/generated/sklearn. >> preprocessing.StandardScaler.html >> requires the whole column. >> >> ( >> http://scikit-learn.org/stable/modules/preprocessing.html# >> preprocessing-scaler >> ) >> >> >> >> >> > >> >> >> >>> >> >>> >> >>> Luke >> >>> >> >>> >> >>> >> >> >> > (Trimmed reply-chain (again) because 40Kb limit) >> > >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > From choman at gmail.com Sat Dec 24 17:00:19 2016 From: choman at gmail.com (Chad Homan) Date: Sat, 24 Dec 2016 16:00:19 -0600 Subject: [omaha] Happy Holidays Message-ID: Merry Xmas and Happy New Year Stay Safe -- Chad Some people, when confronted with a problem, think "I know, I'll use Windows." Now they have two problems. Some people claim if you play a Windows Install Disc backwards you'll hear satanic Messages. That's nothing, if you play it forward it installs Windows From bob.haffner at gmail.com Sun Dec 25 20:05:26 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Sun, 25 Dec 2016 19:05:26 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Merry Christmas, everyone! Still heading down the TPOT path with limited success. I get varying scores (tpot.score()) with the same result (kaggle scoring) Any other TPOT users getting inconsistent results? Specifically with 0.6.7? On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < luke.schollmeyer at gmail.com> wrote: > Moved the needle a little bit yesterday with a ridge regression attempt > using the same feature engineering I described before. > > Luke > > On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner > wrote: > >> Made a TPOT attempt tonight. Could only do some numeric features though >> because including categoricals would cause my ipython kernel to die. >> >> I will try a bigger box this weekend >> >> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha >> wrote: >> >>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner >>> wrote: >>> >>> > >>> > >>> > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner >>> wrote: >>> > >>> >> >>> >> >>> >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < >>> >> luke.schollmeyer at gmail.com> wrote: >>> >> >>> >>> The quick explanation is rather than dropping outliers, I used >>> numpy's >>> >>> log1p function to help normalize distribution of the data (for both >>> the >>> >>> sale price and for all features over a certain skewness). I was also >>> >>> struggling with adding in more features to the model. >>> >>> >>> >> >>> >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html >>> >> - http://scikit-learn.org/stable/modules/generated/sklearn. >>> >> preprocessing.FunctionTransformer.html >>> >> >>> >> >>> >> https://en.wikipedia.org/wiki/Data_transformation_(statistic >>> >> s)#Common_transformations >>> >> >>> >> https://en.wikipedia.org/wiki/Log-normal_distribution >>> >> >>> >> >>> >> How did you determine the skewness threshold? >>> >> >>> >> ... >>> >> >>> >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di >>> >> stribution#Specified_variance:_the_normal_distribution >>> >> >>> >> https://en.wikipedia.org/wiki/Normalization_(statistics) >>> >> >>> >> http://scikit-learn.org/stable/modules/preprocessing.html#no >>> rmalization >>> >> >>> > >>> > - https://stackoverflow.com/questions/4674623/why-do-we- >>> > have-to-normalize-the-input-for-an-artificial-neural-network >>> > - https://stats.stackexchange.com/questions/7757/data-normaliz >>> ation-and- >>> > standardization-in-neural-networks >>> > >>> >>> https://github.com/tensorflow/tensorflow/tree/master/tensorf >>> low/contrib/learn/python/learn >>> >>> >>> > >>> > >>> >> >>> >> >>> >> >>> >> >>> >>> The training and test data sets have different "completeness" of some >>> >>> features, and using pd.get_dummies can be problematic when you fit a >>> model >>> >>> versus predicting if you don't have the same columns/features. I >>> simply >>> >>> combined the train and test data sets (without the Id and SalePrice) >>> and >>> >>> ran the get_dummies function over that set. >>> >>> >>> >> >>> >> autoclean_cv loads the train set first and then applies those >>> >> categorical/numerical mappings to the test set >>> >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts >>> >> >>> >> When I modify load_house_prices [1] to also load test.csv in order to >>> >> autoclean_csv, >>> >> I might try assigning the categorical levels according to the ranking >>> in >>> >> data_description.txt, >>> >> rather than the happenstance ordering in train.csv; >>> >> though get_dummies should make that irrelevant. >>> >> >>> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>> >> e_prices/data.py#L45 >>> >> >>> >> I should probably also manually specify that 'Id' is the index column >>> in >>> >> pd.read_csv (assuming there are no duplicates, which pandas should >>> check >>> >> for). >>> >> >>> >> >>> >>> When I needed to fit the model, I just "unraveled" the combined set >>> with >>> >>> the train and test parts. >>> >>> >>> >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], >>> >>> test.loc[:,'MSSubClass':'SaleCondition'])) >>> >>> >>> >>> combined = pd.get_dummies(combined) >>> >>> >>> >>> ::: do some feature engineering ::: >>> >>> >>> >>> trainX = combined[:train.shape[0]] >>> >>> y = train['SalePrice'] >>> >>> >>> >>> Just so long you don't do anything to the combined dataframe (like >>> >>> sorting), you can slice off each part based on it's shape. >>> >>> >>> >> >>> >> http://pandas.pydata.org/pandas-docs/stable/indexing.html# >>> >> returning-a-view-versus-a-copy >>> >> >>> >> >>> >>> >>> >>> and when you would be pulling the data to predict the test data, you >>> get >>> >>> the other part: >>> >>> >>> >>> testX = combined[train.shape[0]:] >>> >>> >>> >> >>> >> Why is the concatenation necessary? >>> >> - log1p doesn't need the whole column >>> >> - get_dummies doesn't need the whole column >>> >> >>> > >>> http://scikit-learn.org/stable/modules/generated/sklearn.pre >>> processing.StandardScaler.html >>> requires the whole column. >>> >>> ( >>> http://scikit-learn.org/stable/modules/preprocessing.html#pr >>> eprocessing-scaler >>> ) >>> >>> >>> >>> >>> > >>> >> >>> >>> >>> >>> >>> >>> Luke >>> >>> >>> >>> >>> >>> >>> >> >>> > (Trimmed reply-chain (again) because 40Kb limit) >>> > >>> > >>> _______________________________________________ >>> Omaha Python Users Group mailing list >>> Omaha at python.org >>> https://mail.python.org/mailman/listinfo/omaha >>> http://www.OmahaPython.org >>> >> >> > From wes.turner at gmail.com Sun Dec 25 20:40:54 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 25 Dec 2016 19:40:54 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Sunday, December 25, 2016, Bob Haffner wrote: > Merry Christmas, everyone! > > Merry Christmas! > > Still heading down the TPOT path with limited success. I get varying > scores (tpot.score()) with the same result (kaggle scoring) > > Any other TPOT users getting inconsistent results? Specifically with > 0.6.7? > There may be variance because of the way TPOT splits X_train into X_train and X_test w/ train_size and test_size. I rewrote load_house_prices as a class w/ a better mccabe cyclomatic complexity score with a concatenation step so that X_train and X_test have the same columns (in data.py) It probably makes sense to use scikit-learn for data transformation (e.g. OneHotEncoder instead of get_dummies). https://twitter.com/westurner/status/813011289475842048 : """ . at scikit_learn Src: https://t.co/biMt6XRt2T Docs: https://t.co/Lb5EYRCdI8 #API: .fit_transform(X, y) .fit(X_train, y_train) .predict(X_test) """ I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not prevent the oom error. Looking at https://libraries.io/pypi/xgboost "Dependent Repositories", there are a number of scikit-learn-compatible packages for automating analysis in addition to TPOT: auto-sklearn, rep. auto_ml mentions 12 algos for type_of_estimator='regressor'. (and sparse matrices, and other parameters). https://github.com/ClimbsRocks/auto_ml http://auto-ml.readthedocs.io/en/latest/ I should be able to generate column_descriptions from parse_description in data.py: https://github.com/westurner/house_prices/blob/develop/house_prices/data.py https://github.com/automl/auto-sklearn looks cool too. ... http://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw http://tflearn.org > > > On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < > luke.schollmeyer at gmail.com > > wrote: > >> Moved the needle a little bit yesterday with a ridge regression attempt >> using the same feature engineering I described before. >> >> Luke >> >> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner > > wrote: >> >>> Made a TPOT attempt tonight. Could only do some numeric features though >>> because including categoricals would cause my ipython kernel to die. >>> >>> I will try a bigger box this weekend >>> >>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha >> > wrote: >>> >>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner >>> > wrote: >>>> >>>> > >>>> > >>>> > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner >>> > wrote: >>>> > >>>> >> >>>> >> >>>> >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < >>>> >> luke.schollmeyer at gmail.com >>>> > wrote: >>>> >> >>>> >>> The quick explanation is rather than dropping outliers, I used >>>> numpy's >>>> >>> log1p function to help normalize distribution of the data (for both >>>> the >>>> >>> sale price and for all features over a certain skewness). I was also >>>> >>> struggling with adding in more features to the model. >>>> >>> >>>> >> >>>> >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l >>>> og1p.html >>>> >> - http://scikit-learn.org/stable/modules/generated/sklearn. >>>> >> preprocessing.FunctionTransformer.html >>>> >> >>>> >> >>>> >> https://en.wikipedia.org/wiki/Data_transformation_(statistic >>>> >> s)#Common_transformations >>>> >> >>>> >> https://en.wikipedia.org/wiki/Log-normal_distribution >>>> >> >>>> >> >>>> >> How did you determine the skewness threshold? >>>> >> >>>> >> ... >>>> >> >>>> >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di >>>> >> stribution#Specified_variance:_the_normal_distribution >>>> >> >>>> >> https://en.wikipedia.org/wiki/Normalization_(statistics) >>>> >> >>>> >> http://scikit-learn.org/stable/modules/preprocessing.html#no >>>> rmalization >>>> >> >>>> > >>>> > - https://stackoverflow.com/questions/4674623/why-do-we- >>>> > have-to-normalize-the-input-for-an-artificial-neural-network >>>> > - https://stats.stackexchange.com/questions/7757/data-normaliz >>>> ation-and- >>>> > standardization-in-neural-networks >>>> > >>>> >>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf >>>> low/contrib/learn/python/learn >>>> >>>> >>>> > >>>> > >>>> >> >>>> >> >>>> >> >>>> >> >>>> >>> The training and test data sets have different "completeness" of >>>> some >>>> >>> features, and using pd.get_dummies can be problematic when you fit >>>> a model >>>> >>> versus predicting if you don't have the same columns/features. I >>>> simply >>>> >>> combined the train and test data sets (without the Id and >>>> SalePrice) and >>>> >>> ran the get_dummies function over that set. >>>> >>> >>>> >> >>>> >> autoclean_cv loads the train set first and then applies those >>>> >> categorical/numerical mappings to the test set >>>> >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts >>>> >> >>>> >> When I modify load_house_prices [1] to also load test.csv in order to >>>> >> autoclean_csv, >>>> >> I might try assigning the categorical levels according to the >>>> ranking in >>>> >> data_description.txt, >>>> >> rather than the happenstance ordering in train.csv; >>>> >> though get_dummies should make that irrelevant. >>>> >> >>>> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>> >> e_prices/data.py#L45 >>>> >> >>>> >> I should probably also manually specify that 'Id' is the index >>>> column in >>>> >> pd.read_csv (assuming there are no duplicates, which pandas should >>>> check >>>> >> for). >>>> >> >>>> >> >>>> >>> When I needed to fit the model, I just "unraveled" the combined set >>>> with >>>> >>> the train and test parts. >>>> >>> >>>> >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], >>>> >>> test.loc[:,'MSSubClass':'SaleCondition'])) >>>> >>> >>>> >>> combined = pd.get_dummies(combined) >>>> >>> >>>> >>> ::: do some feature engineering ::: >>>> >>> >>>> >>> trainX = combined[:train.shape[0]] >>>> >>> y = train['SalePrice'] >>>> >>> >>>> >>> Just so long you don't do anything to the combined dataframe (like >>>> >>> sorting), you can slice off each part based on it's shape. >>>> >>> >>>> >> >>>> >> http://pandas.pydata.org/pandas-docs/stable/indexing.html# >>>> >> returning-a-view-versus-a-copy >>>> >> >>>> >> >>>> >>> >>>> >>> and when you would be pulling the data to predict the test data, >>>> you get >>>> >>> the other part: >>>> >>> >>>> >>> testX = combined[train.shape[0]:] >>>> >>> >>>> >> >>>> >> Why is the concatenation necessary? >>>> >> - log1p doesn't need the whole column >>>> >> - get_dummies doesn't need the whole column >>>> >> >>>> > >>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre >>>> processing.StandardScaler.html >>>> requires the whole column. >>>> >>>> ( >>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr >>>> eprocessing-scaler >>>> ) >>>> >>>> >>>> >>>> >>>> > >>>> >> >>>> >>> >>>> >>> >>>> >>> Luke >>>> >>> >>>> >>> >>>> >>> >>>> >> >>>> > (Trimmed reply-chain (again) because 40Kb limit) >>>> > >>>> > >>>> _______________________________________________ >>>> Omaha Python Users Group mailing list >>>> Omaha at python.org >>>> https://mail.python.org/mailman/listinfo/omaha >>>> http://www.OmahaPython.org >>>> >>> >>> >> > From wes.turner at gmail.com Sun Dec 25 22:41:56 2016 From: wes.turner at gmail.com (Wes Turner) Date: Sun, 25 Dec 2016 21:41:56 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner wrote: > > > On Sunday, December 25, 2016, Bob Haffner wrote: > >> Merry Christmas, everyone! >> >> > Merry Christmas! > >> >> Still heading down the TPOT path with limited success. I get varying >> scores (tpot.score()) with the same result (kaggle scoring) >> >> Any other TPOT users getting inconsistent results? Specifically with >> 0.6.7? >> > > There may be variance because of the way TPOT splits X_train into X_train > and X_test w/ train_size and test_size. > > I rewrote load_house_prices as a class w/ a better mccabe cyclomatic > complexity score with a concatenation step so that X_train and X_test have > the same columns (in data.py) > > It probably makes sense to use scikit-learn for data transformation (e.g. > OneHotEncoder instead of get_dummies). > > https://twitter.com/westurner/status/813011289475842048 : > """ > . at scikit_learn > Src: https://t.co/biMt6XRt2T > Docs: https://t.co/Lb5EYRCdI8 > #API: > .fit_transform(X, y) > .fit(X_train, y_train) > .predict(X_test) > """ > > I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not > prevent the oom error. > > Looking at https://libraries.io/pypi/xgboost "Dependent Repositories", > there are a number of scikit-learn-compatible packages for automating > analysis in addition to TPOT: auto-sklearn, rep. > auto_ml mentions 12 algos for type_of_estimator='regressor'. > (and sparse matrices, and other parameters). > > https://github.com/ClimbsRocks/auto_ml > > http://auto-ml.readthedocs.io/en/latest/ > Here's a (probably overfitted) auto_ml attempt: https://github.com/westurner/house_prices/blob/7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3958857 ..."Your submission scored 9.45422, which is not an improvement of your best score. " Setting .train(compute_power=10) errored out after a bunch of GridSearchCV. > > > I should be able to generate column_descriptions from parse_description in > data.py: > https://github.com/westurner/house_prices/blob/develop/ > house_prices/data.py > > https://github.com/automl/auto-sklearn looks cool too. > > ... http://stats.stackexchange.com/questions/181/how-to-choose-the-number- > of-hidden-layers-and-nodes-in-a-feedforward-neural-netw > > http://tflearn.org > > >> >> >> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < >> luke.schollmeyer at gmail.com> wrote: >> >>> Moved the needle a little bit yesterday with a ridge regression attempt >>> using the same feature engineering I described before. >>> >>> Luke >>> >>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner >>> wrote: >>> >>>> Made a TPOT attempt tonight. Could only do some numeric features >>>> though because including categoricals would cause my ipython kernel to die. >>>> >>>> I will try a bigger box this weekend >>>> >>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha >>> > wrote: >>>> >>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner >>>>> wrote: >>>>> >>>>> > >>>>> > >>>>> > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner >>>>> wrote: >>>>> > >>>>> >> >>>>> >> >>>>> >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < >>>>> >> luke.schollmeyer at gmail.com> wrote: >>>>> >> >>>>> >>> The quick explanation is rather than dropping outliers, I used >>>>> numpy's >>>>> >>> log1p function to help normalize distribution of the data (for >>>>> both the >>>>> >>> sale price and for all features over a certain skewness). I was >>>>> also >>>>> >>> struggling with adding in more features to the model. >>>>> >>> >>>>> >> >>>>> >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l >>>>> og1p.html >>>>> >> - http://scikit-learn.org/stable/modules/generated/sklearn. >>>>> >> preprocessing.FunctionTransformer.html >>>>> >> >>>>> >> >>>>> >> https://en.wikipedia.org/wiki/Data_transformation_(statistic >>>>> >> s)#Common_transformations >>>>> >> >>>>> >> https://en.wikipedia.org/wiki/Log-normal_distribution >>>>> >> >>>>> >> >>>>> >> How did you determine the skewness threshold? >>>>> >> >>>>> >> ... >>>>> >> >>>>> >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di >>>>> >> stribution#Specified_variance:_the_normal_distribution >>>>> >> >>>>> >> https://en.wikipedia.org/wiki/Normalization_(statistics) >>>>> >> >>>>> >> http://scikit-learn.org/stable/modules/preprocessing.html#no >>>>> rmalization >>>>> >> >>>>> > >>>>> > - https://stackoverflow.com/questions/4674623/why-do-we- >>>>> > have-to-normalize-the-input-for-an-artificial-neural-network >>>>> > - https://stats.stackexchange.com/questions/7757/data-normaliz >>>>> ation-and- >>>>> > standardization-in-neural-networks >>>>> > >>>>> >>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf >>>>> low/contrib/learn/python/learn >>>>> >>>>> >>>>> > >>>>> > >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >>> The training and test data sets have different "completeness" of >>>>> some >>>>> >>> features, and using pd.get_dummies can be problematic when you fit >>>>> a model >>>>> >>> versus predicting if you don't have the same columns/features. I >>>>> simply >>>>> >>> combined the train and test data sets (without the Id and >>>>> SalePrice) and >>>>> >>> ran the get_dummies function over that set. >>>>> >>> >>>>> >> >>>>> >> autoclean_cv loads the train set first and then applies those >>>>> >> categorical/numerical mappings to the test set >>>>> >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts >>>>> >> >>>>> >> When I modify load_house_prices [1] to also load test.csv in order >>>>> to >>>>> >> autoclean_csv, >>>>> >> I might try assigning the categorical levels according to the >>>>> ranking in >>>>> >> data_description.txt, >>>>> >> rather than the happenstance ordering in train.csv; >>>>> >> though get_dummies should make that irrelevant. >>>>> >> >>>>> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>> >> e_prices/data.py#L45 >>>>> >> >>>>> >> I should probably also manually specify that 'Id' is the index >>>>> column in >>>>> >> pd.read_csv (assuming there are no duplicates, which pandas should >>>>> check >>>>> >> for). >>>>> >> >>>>> >> >>>>> >>> When I needed to fit the model, I just "unraveled" the combined >>>>> set with >>>>> >>> the train and test parts. >>>>> >>> >>>>> >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], >>>>> >>> test.loc[:,'MSSubClass':'SaleCondition'])) >>>>> >>> >>>>> >>> combined = pd.get_dummies(combined) >>>>> >>> >>>>> >>> ::: do some feature engineering ::: >>>>> >>> >>>>> >>> trainX = combined[:train.shape[0]] >>>>> >>> y = train['SalePrice'] >>>>> >>> >>>>> >>> Just so long you don't do anything to the combined dataframe (like >>>>> >>> sorting), you can slice off each part based on it's shape. >>>>> >>> >>>>> >> >>>>> >> http://pandas.pydata.org/pandas-docs/stable/indexing.html# >>>>> >> returning-a-view-versus-a-copy >>>>> >> >>>>> >> >>>>> >>> >>>>> >>> and when you would be pulling the data to predict the test data, >>>>> you get >>>>> >>> the other part: >>>>> >>> >>>>> >>> testX = combined[train.shape[0]:] >>>>> >>> >>>>> >> >>>>> >> Why is the concatenation necessary? >>>>> >> - log1p doesn't need the whole column >>>>> >> - get_dummies doesn't need the whole column >>>>> >> >>>>> > >>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre >>>>> processing.StandardScaler.html >>>>> requires the whole column. >>>>> >>>>> ( >>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr >>>>> eprocessing-scaler >>>>> ) >>>>> >>>>> >>>>> >>>>> >>>>> > >>>>> >> >>>>> >>> >>>>> >>> >>>>> >>> Luke >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >> >>>>> > (Trimmed reply-chain (again) because 40Kb limit) >>>>> > >>>>> > >>>>> _______________________________________________ >>>>> Omaha Python Users Group mailing list >>>>> Omaha at python.org >>>>> https://mail.python.org/mailman/listinfo/omaha >>>>> http://www.OmahaPython.org >>>>> >>>> >>>> >>> >> From uiab1638 at yahoo.com Wed Dec 28 01:56:21 2016 From: uiab1638 at yahoo.com (Jeremy Doyle) Date: Wed, 28 Dec 2016 00:56:21 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our score! Currently sitting at 798th place. My notebook is on GitHub for those interested: https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4 Jeremy > On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha wrote: > >> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner wrote: >> >> >> >>> On Sunday, December 25, 2016, Bob Haffner wrote: >>> >>> Merry Christmas, everyone! >>> >>> >> Merry Christmas! >> >>> >>> Still heading down the TPOT path with limited success. I get varying >>> scores (tpot.score()) with the same result (kaggle scoring) >>> >>> Any other TPOT users getting inconsistent results? Specifically with >>> 0.6.7? >>> >> >> There may be variance because of the way TPOT splits X_train into X_train >> and X_test w/ train_size and test_size. >> >> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic >> complexity score with a concatenation step so that X_train and X_test have >> the same columns (in data.py) >> >> It probably makes sense to use scikit-learn for data transformation (e.g. >> OneHotEncoder instead of get_dummies). >> >> https://twitter.com/westurner/status/813011289475842048 : >> """ >> . at scikit_learn >> Src: https://t.co/biMt6XRt2T >> Docs: https://t.co/Lb5EYRCdI8 >> #API: >> .fit_transform(X, y) >> .fit(X_train, y_train) >> .predict(X_test) >> """ >> >> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not >> prevent the oom error. >> >> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories", >> there are a number of scikit-learn-compatible packages for automating >> analysis in addition to TPOT: auto-sklearn, rep. >> auto_ml mentions 12 algos for type_of_estimator='regressor'. >> (and sparse matrices, and other parameters). >> >> https://github.com/ClimbsRocks/auto_ml >> >> http://auto-ml.readthedocs.io/en/latest/ >> > > Here's a (probably overfitted) auto_ml attempt: > https://github.com/westurner/house_prices/blob/7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3958857 > ..."Your submission scored 9.45422, which is not an improvement of your > best score. " > > Setting .train(compute_power=10) errored out after a bunch of GridSearchCV. > > >> >> >> I should be able to generate column_descriptions from parse_description in >> data.py: >> https://github.com/westurner/house_prices/blob/develop/ >> house_prices/data.py >> >> https://github.com/automl/auto-sklearn looks cool too. >> >> ... http://stats.stackexchange.com/questions/181/how-to-choose-the-number- >> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw >> >> http://tflearn.org >> >> >>> >>> >>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < >>> luke.schollmeyer at gmail.com> wrote: >>> >>>> Moved the needle a little bit yesterday with a ridge regression attempt >>>> using the same feature engineering I described before. >>>> >>>> Luke >>>> >>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner >>>> wrote: >>>> >>>>> Made a TPOT attempt tonight. Could only do some numeric features >>>>> though because including categoricals would cause my ipython kernel to die. >>>>> >>>>> I will try a bigger box this weekend >>>>> >>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha >>>>> wrote: >>>>> >>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner >>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < >>>>>>>> luke.schollmeyer at gmail.com> wrote: >>>>>>>> >>>>>>>>> The quick explanation is rather than dropping outliers, I used >>>>>> numpy's >>>>>>>>> log1p function to help normalize distribution of the data (for >>>>>> both the >>>>>>>>> sale price and for all features over a certain skewness). I was >>>>>> also >>>>>>>>> struggling with adding in more features to the model. >>>>>>>>> >>>>>>>> >>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l >>>>>> og1p.html >>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn. >>>>>>>> preprocessing.FunctionTransformer.html >>>>>>>> >>>>>>>> >>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic >>>>>>>> s)#Common_transformations >>>>>>>> >>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution >>>>>>>> >>>>>>>> >>>>>>>> How did you determine the skewness threshold? >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di >>>>>>>> stribution#Specified_variance:_the_normal_distribution >>>>>>>> >>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics) >>>>>>>> >>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no >>>>>> rmalization >>>>>>>> >>>>>>> >>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we- >>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network >>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz >>>>>> ation-and- >>>>>>> standardization-in-neural-networks >>>>>>> >>>>>> >>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf >>>>>> low/contrib/learn/python/learn >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> The training and test data sets have different "completeness" of >>>>>> some >>>>>>>>> features, and using pd.get_dummies can be problematic when you fit >>>>>> a model >>>>>>>>> versus predicting if you don't have the same columns/features. I >>>>>> simply >>>>>>>>> combined the train and test data sets (without the Id and >>>>>> SalePrice) and >>>>>>>>> ran the get_dummies function over that set. >>>>>>>>> >>>>>>>> >>>>>>>> autoclean_cv loads the train set first and then applies those >>>>>>>> categorical/numerical mappings to the test set >>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts >>>>>>>> >>>>>>>> When I modify load_house_prices [1] to also load test.csv in order >>>>>> to >>>>>>>> autoclean_csv, >>>>>>>> I might try assigning the categorical levels according to the >>>>>> ranking in >>>>>>>> data_description.txt, >>>>>>>> rather than the happenstance ordering in train.csv; >>>>>>>> though get_dummies should make that irrelevant. >>>>>>>> >>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>>>>> e_prices/data.py#L45 >>>>>>>> >>>>>>>> I should probably also manually specify that 'Id' is the index >>>>>> column in >>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should >>>>>> check >>>>>>>> for). >>>>>>>> >>>>>>>> >>>>>>>>> When I needed to fit the model, I just "unraveled" the combined >>>>>> set with >>>>>>>>> the train and test parts. >>>>>>>>> >>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], >>>>>>>>> test.loc[:,'MSSubClass':'SaleCondition'])) >>>>>>>>> >>>>>>>>> combined = pd.get_dummies(combined) >>>>>>>>> >>>>>>>>> ::: do some feature engineering ::: >>>>>>>>> >>>>>>>>> trainX = combined[:train.shape[0]] >>>>>>>>> y = train['SalePrice'] >>>>>>>>> >>>>>>>>> Just so long you don't do anything to the combined dataframe (like >>>>>>>>> sorting), you can slice off each part based on it's shape. >>>>>>>>> >>>>>>>> >>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html# >>>>>>>> returning-a-view-versus-a-copy >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> and when you would be pulling the data to predict the test data, >>>>>> you get >>>>>>>>> the other part: >>>>>>>>> >>>>>>>>> testX = combined[train.shape[0]:] >>>>>>>>> >>>>>>>> >>>>>>>> Why is the concatenation necessary? >>>>>>>> - log1p doesn't need the whole column >>>>>>>> - get_dummies doesn't need the whole column >>>>>>>> >>>>>>> >>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre >>>>>> processing.StandardScaler.html >>>>>> requires the whole column. >>>>>> >>>>>> ( >>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr >>>>>> eprocessing-scaler >>>>>> ) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Luke >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> (Trimmed reply-chain (again) because 40Kb limit) >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Omaha Python Users Group mailing list >>>>>> Omaha at python.org >>>>>> https://mail.python.org/mailman/listinfo/omaha >>>>>> http://www.OmahaPython.org >>>>>> >>>>> >>>>> >>>> >>> > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org From uiab1638 at yahoo.com Wed Dec 28 02:08:48 2016 From: uiab1638 at yahoo.com (Jeremy Doyle) Date: Wed, 28 Dec 2016 01:08:48 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: <6BF65F7C-8DCE-43F1-919D-5BA9D9C0E675@yahoo.com> Sent from my iPhone > On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha wrote: > >> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner wrote: >> >> >> >>> On Sunday, December 25, 2016, Bob Haffner wrote: >>> >>> Merry Christmas, everyone! >>> >>> >> Merry Christmas! >> >>> >>> Still heading down the TPOT path with limited success. I get varying >>> scores (tpot.score()) with the same result (kaggle scoring) >>> >>> Any other TPOT users getting inconsistent results? Specifically with >>> 0.6.7? >>> >> >> There may be variance because of the way TPOT splits X_train into X_train >> and X_test w/ train_size and test_size. >> >> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic >> complexity score with a concatenation step so that X_train and X_test have >> the same columns (in data.py) >> >> It probably makes sense to use scikit-learn for data transformation (e.g. >> OneHotEncoder instead of get_dummies). >> >> https://twitter.com/westurner/status/813011289475842048 : >> """ >> . at scikit_learn >> Src: https://t.co/biMt6XRt2T >> Docs: https://t.co/Lb5EYRCdI8 >> #API: >> .fit_transform(X, y) >> .fit(X_train, y_train) >> .predict(X_test) >> """ >> >> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not >> prevent the oom error. >> >> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories", >> there are a number of scikit-learn-compatible packages for automating >> analysis in addition to TPOT: auto-sklearn, rep. >> auto_ml mentions 12 algos for type_of_estimator='regressor'. >> (and sparse matrices, and other parameters). >> >> https://github.com/ClimbsRocks/auto_ml >> >> http://auto-ml.readthedocs.io/en/latest/ >> > > Here's a (probably overfitted) auto_ml attempt: > https://github.com/westurner/house_prices/blob/7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3958857 > ..."Your submission scored 9.45422, which is not an improvement of your > best score. " > > Setting .train(compute_power=10) errored out after a bunch of GridSearchCV. > > >> >> >> I should be able to generate column_descriptions from parse_description in >> data.py: >> https://github.com/westurner/house_prices/blob/develop/ >> house_prices/data.py >> >> https://github.com/automl/auto-sklearn looks cool too. >> >> ... http://stats.stackexchange.com/questions/181/how-to-choose-the-number- >> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw >> >> http://tflearn.org >> >> >>> >>> >>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < >>> luke.schollmeyer at gmail.com> wrote: >>> >>>> Moved the needle a little bit yesterday with a ridge regression attempt >>>> using the same feature engineering I described before. >>>> >>>> Luke >>>> >>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner >>>> wrote: >>>> >>>>> Made a TPOT attempt tonight. Could only do some numeric features >>>>> though because including categoricals would cause my ipython kernel to die. >>>>> >>>>> I will try a bigger box this weekend >>>>> >>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha >>>>> wrote: >>>>> >>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner >>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < >>>>>>>> luke.schollmeyer at gmail.com> wrote: >>>>>>>> >>>>>>>>> The quick explanation is rather than dropping outliers, I used >>>>>> numpy's >>>>>>>>> log1p function to help normalize distribution of the data (for >>>>>> both the >>>>>>>>> sale price and for all features over a certain skewness). I was >>>>>> also >>>>>>>>> struggling with adding in more features to the model. >>>>>>>>> >>>>>>>> >>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l >>>>>> og1p.html >>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn. >>>>>>>> preprocessing.FunctionTransformer.html >>>>>>>> >>>>>>>> >>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic >>>>>>>> s)#Common_transformations >>>>>>>> >>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution >>>>>>>> >>>>>>>> >>>>>>>> How did you determine the skewness threshold? >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di >>>>>>>> stribution#Specified_variance:_the_normal_distribution >>>>>>>> >>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics) >>>>>>>> >>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no >>>>>> rmalization >>>>>>>> >>>>>>> >>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we- >>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network >>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz >>>>>> ation-and- >>>>>>> standardization-in-neural-networks >>>>>>> >>>>>> >>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf >>>>>> low/contrib/learn/python/learn >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> The training and test data sets have different "completeness" of >>>>>> some >>>>>>>>> features, and using pd.get_dummies can be problematic when you fit >>>>>> a model >>>>>>>>> versus predicting if you don't have the same columns/features. I >>>>>> simply >>>>>>>>> combined the train and test data sets (without the Id and >>>>>> SalePrice) and >>>>>>>>> ran the get_dummies function over that set. >>>>>>>>> >>>>>>>> >>>>>>>> autoclean_cv loads the train set first and then applies those >>>>>>>> categorical/numerical mappings to the test set >>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts >>>>>>>> >>>>>>>> When I modify load_house_prices [1] to also load test.csv in order >>>>>> to >>>>>>>> autoclean_csv, >>>>>>>> I might try assigning the categorical levels according to the >>>>>> ranking in >>>>>>>> data_description.txt, >>>>>>>> rather than the happenstance ordering in train.csv; >>>>>>>> though get_dummies should make that irrelevant. >>>>>>>> >>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>>>>> e_prices/data.py#L45 >>>>>>>> >>>>>>>> I should probably also manually specify that 'Id' is the index >>>>>> column in >>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should >>>>>> check >>>>>>>> for). >>>>>>>> >>>>>>>> >>>>>>>>> When I needed to fit the model, I just "unraveled" the combined >>>>>> set with >>>>>>>>> the train and test parts. >>>>>>>>> >>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], >>>>>>>>> test.loc[:,'MSSubClass':'SaleCondition'])) >>>>>>>>> >>>>>>>>> combined = pd.get_dummies(combined) >>>>>>>>> >>>>>>>>> ::: do some feature engineering ::: >>>>>>>>> >>>>>>>>> trainX = combined[:train.shape[0]] >>>>>>>>> y = train['SalePrice'] >>>>>>>>> >>>>>>>>> Just so long you don't do anything to the combined dataframe (like >>>>>>>>> sorting), you can slice off each part based on it's shape. >>>>>>>>> >>>>>>>> >>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html# >>>>>>>> returning-a-view-versus-a-copy >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> and when you would be pulling the data to predict the test data, >>>>>> you get >>>>>>>>> the other part: >>>>>>>>> >>>>>>>>> testX = combined[train.shape[0]:] >>>>>>>>> >>>>>>>> >>>>>>>> Why is the concatenation necessary? >>>>>>>> - log1p doesn't need the whole column >>>>>>>> - get_dummies doesn't need the whole column >>>>>>>> >>>>>>> >>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre >>>>>> processing.StandardScaler.html >>>>>> requires the whole column. >>>>>> >>>>>> ( >>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr >>>>>> eprocessing-scaler >>>>>> ) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Luke >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> (Trimmed reply-chain (again) because 40Kb limit) >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Omaha Python Users Group mailing list >>>>>> Omaha at python.org >>>>>> https://mail.python.org/mailman/listinfo/omaha >>>>>> http://www.OmahaPython.org >>>>>> >>>>> >>>>> >>>> >>> > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org From bob.haffner at gmail.com Wed Dec 28 08:21:03 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Wed, 28 Dec 2016 07:21:03 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Wes, Yeah, I should try setting the random_state param and see if I still get the wide variance I've been seeing. I'll also check out the OneHotEncoder Bob On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner wrote: > > > On Sunday, December 25, 2016, Bob Haffner wrote: > >> Merry Christmas, everyone! >> >> > Merry Christmas! > >> >> Still heading down the TPOT path with limited success. I get varying >> scores (tpot.score()) with the same result (kaggle scoring) >> >> Any other TPOT users getting inconsistent results? Specifically with >> 0.6.7? >> > > There may be variance because of the way TPOT splits X_train into X_train > and X_test w/ train_size and test_size. > > I rewrote load_house_prices as a class w/ a better mccabe cyclomatic > complexity score with a concatenation step so that X_train and X_test have > the same columns (in data.py) > > It probably makes sense to use scikit-learn for data transformation (e.g. > OneHotEncoder instead of get_dummies). > > https://twitter.com/westurner/status/813011289475842048 : > """ > . at scikit_learn > Src: https://t.co/biMt6XRt2T > Docs: https://t.co/Lb5EYRCdI8 > #API: > .fit_transform(X, y) > .fit(X_train, y_train) > .predict(X_test) > """ > > I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not > prevent the oom error. > > Looking at https://libraries.io/pypi/xgboost "Dependent Repositories", > there are a number of scikit-learn-compatible packages for automating > analysis in addition to TPOT: auto-sklearn, rep. > auto_ml mentions 12 algos for type_of_estimator='regressor'. > (and sparse matrices, and other parameters). > > https://github.com/ClimbsRocks/auto_ml > > http://auto-ml.readthedocs.io/en/latest/ > > I should be able to generate column_descriptions from parse_description in > data.py: > https://github.com/westurner/house_prices/blob/develop/ > house_prices/data.py > > https://github.com/automl/auto-sklearn looks cool too. > > ... http://stats.stackexchange.com/questions/181/how-to-choose-the-number- > of-hidden-layers-and-nodes-in-a-feedforward-neural-netw > > http://tflearn.org > > >> >> >> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < >> luke.schollmeyer at gmail.com> wrote: >> >>> Moved the needle a little bit yesterday with a ridge regression attempt >>> using the same feature engineering I described before. >>> >>> Luke >>> >>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner >>> wrote: >>> >>>> Made a TPOT attempt tonight. Could only do some numeric features >>>> though because including categoricals would cause my ipython kernel to die. >>>> >>>> I will try a bigger box this weekend >>>> >>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha >>> > wrote: >>>> >>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner >>>>> wrote: >>>>> >>>>> > >>>>> > >>>>> > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner >>>>> wrote: >>>>> > >>>>> >> >>>>> >> >>>>> >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < >>>>> >> luke.schollmeyer at gmail.com> wrote: >>>>> >> >>>>> >>> The quick explanation is rather than dropping outliers, I used >>>>> numpy's >>>>> >>> log1p function to help normalize distribution of the data (for >>>>> both the >>>>> >>> sale price and for all features over a certain skewness). I was >>>>> also >>>>> >>> struggling with adding in more features to the model. >>>>> >>> >>>>> >> >>>>> >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l >>>>> og1p.html >>>>> >> - http://scikit-learn.org/stable/modules/generated/sklearn. >>>>> >> preprocessing.FunctionTransformer.html >>>>> >> >>>>> >> >>>>> >> https://en.wikipedia.org/wiki/Data_transformation_(statistic >>>>> >> s)#Common_transformations >>>>> >> >>>>> >> https://en.wikipedia.org/wiki/Log-normal_distribution >>>>> >> >>>>> >> >>>>> >> How did you determine the skewness threshold? >>>>> >> >>>>> >> ... >>>>> >> >>>>> >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di >>>>> >> stribution#Specified_variance:_the_normal_distribution >>>>> >> >>>>> >> https://en.wikipedia.org/wiki/Normalization_(statistics) >>>>> >> >>>>> >> http://scikit-learn.org/stable/modules/preprocessing.html#no >>>>> rmalization >>>>> >> >>>>> > >>>>> > - https://stackoverflow.com/questions/4674623/why-do-we- >>>>> > have-to-normalize-the-input-for-an-artificial-neural-network >>>>> > - https://stats.stackexchange.com/questions/7757/data-normaliz >>>>> ation-and- >>>>> > standardization-in-neural-networks >>>>> > >>>>> >>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf >>>>> low/contrib/learn/python/learn >>>>> >>>>> >>>>> > >>>>> > >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >>> The training and test data sets have different "completeness" of >>>>> some >>>>> >>> features, and using pd.get_dummies can be problematic when you fit >>>>> a model >>>>> >>> versus predicting if you don't have the same columns/features. I >>>>> simply >>>>> >>> combined the train and test data sets (without the Id and >>>>> SalePrice) and >>>>> >>> ran the get_dummies function over that set. >>>>> >>> >>>>> >> >>>>> >> autoclean_cv loads the train set first and then applies those >>>>> >> categorical/numerical mappings to the test set >>>>> >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts >>>>> >> >>>>> >> When I modify load_house_prices [1] to also load test.csv in order >>>>> to >>>>> >> autoclean_csv, >>>>> >> I might try assigning the categorical levels according to the >>>>> ranking in >>>>> >> data_description.txt, >>>>> >> rather than the happenstance ordering in train.csv; >>>>> >> though get_dummies should make that irrelevant. >>>>> >> >>>>> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous >>>>> >> e_prices/data.py#L45 >>>>> >> >>>>> >> I should probably also manually specify that 'Id' is the index >>>>> column in >>>>> >> pd.read_csv (assuming there are no duplicates, which pandas should >>>>> check >>>>> >> for). >>>>> >> >>>>> >> >>>>> >>> When I needed to fit the model, I just "unraveled" the combined >>>>> set with >>>>> >>> the train and test parts. >>>>> >>> >>>>> >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], >>>>> >>> test.loc[:,'MSSubClass':'SaleCondition'])) >>>>> >>> >>>>> >>> combined = pd.get_dummies(combined) >>>>> >>> >>>>> >>> ::: do some feature engineering ::: >>>>> >>> >>>>> >>> trainX = combined[:train.shape[0]] >>>>> >>> y = train['SalePrice'] >>>>> >>> >>>>> >>> Just so long you don't do anything to the combined dataframe (like >>>>> >>> sorting), you can slice off each part based on it's shape. >>>>> >>> >>>>> >> >>>>> >> http://pandas.pydata.org/pandas-docs/stable/indexing.html# >>>>> >> returning-a-view-versus-a-copy >>>>> >> >>>>> >> >>>>> >>> >>>>> >>> and when you would be pulling the data to predict the test data, >>>>> you get >>>>> >>> the other part: >>>>> >>> >>>>> >>> testX = combined[train.shape[0]:] >>>>> >>> >>>>> >> >>>>> >> Why is the concatenation necessary? >>>>> >> - log1p doesn't need the whole column >>>>> >> - get_dummies doesn't need the whole column >>>>> >> >>>>> > >>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre >>>>> processing.StandardScaler.html >>>>> requires the whole column. >>>>> >>>>> ( >>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr >>>>> eprocessing-scaler >>>>> ) >>>>> >>>>> >>>>> >>>>> >>>>> > >>>>> >> >>>>> >>> >>>>> >>> >>>>> >>> Luke >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >> >>>>> > (Trimmed reply-chain (again) because 40Kb limit) >>>>> > >>>>> > >>>>> _______________________________________________ >>>>> Omaha Python Users Group mailing list >>>>> Omaha at python.org >>>>> https://mail.python.org/mailman/listinfo/omaha >>>>> http://www.OmahaPython.org >>>>> >>>> >>>> >>> >> From bob.haffner at gmail.com Wed Dec 28 08:22:39 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Wed, 28 Dec 2016 07:22:39 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Nice job, Jeremy! We're in the triple digits!! On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha wrote: > Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our > score! Currently sitting at 798th place. > > My notebook is on GitHub for those interested: > > https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4 > > Jeremy > > > > On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha > wrote: > > > >> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner > wrote: > >> > >> > >> > >>> On Sunday, December 25, 2016, Bob Haffner > wrote: > >>> > >>> Merry Christmas, everyone! > >>> > >>> > >> Merry Christmas! > >> > >>> > >>> Still heading down the TPOT path with limited success. I get varying > >>> scores (tpot.score()) with the same result (kaggle scoring) > >>> > >>> Any other TPOT users getting inconsistent results? Specifically with > >>> 0.6.7? > >>> > >> > >> There may be variance because of the way TPOT splits X_train into > X_train > >> and X_test w/ train_size and test_size. > >> > >> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic > >> complexity score with a concatenation step so that X_train and X_test > have > >> the same columns (in data.py) > >> > >> It probably makes sense to use scikit-learn for data transformation > (e.g. > >> OneHotEncoder instead of get_dummies). > >> > >> https://twitter.com/westurner/status/813011289475842048 : > >> """ > >> . at scikit_learn > >> Src: https://t.co/biMt6XRt2T > >> Docs: https://t.co/Lb5EYRCdI8 > >> #API: > >> .fit_transform(X, y) > >> .fit(X_train, y_train) > >> .predict(X_test) > >> """ > >> > >> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may > not > >> prevent the oom error. > >> > >> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories", > >> there are a number of scikit-learn-compatible packages for automating > >> analysis in addition to TPOT: auto-sklearn, rep. > >> auto_ml mentions 12 algos for type_of_estimator='regressor'. > >> (and sparse matrices, and other parameters). > >> > >> https://github.com/ClimbsRocks/auto_ml > >> > >> http://auto-ml.readthedocs.io/en/latest/ > >> > > > > Here's a (probably overfitted) auto_ml attempt: > > https://github.com/westurner/house_prices/blob/ > 7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py > > > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/ > leaderboard?submissionId=3958857 > > ..."Your submission scored 9.45422, which is not an improvement of your > > best score. " > > > > Setting .train(compute_power=10) errored out after a bunch of > GridSearchCV. > > > > > >> > >> > >> I should be able to generate column_descriptions from parse_description > in > >> data.py: > >> https://github.com/westurner/house_prices/blob/develop/ > >> house_prices/data.py > >> > >> https://github.com/automl/auto-sklearn looks cool too. > >> > >> ... http://stats.stackexchange.com/questions/181/how-to- > choose-the-number- > >> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw > >> > >> http://tflearn.org > >> > >> > >>> > >>> > >>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < > >>> luke.schollmeyer at gmail.com> wrote: > >>> > >>>> Moved the needle a little bit yesterday with a ridge regression > attempt > >>>> using the same feature engineering I described before. > >>>> > >>>> Luke > >>>> > >>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner > >>>> wrote: > >>>> > >>>>> Made a TPOT attempt tonight. Could only do some numeric features > >>>>> though because including categoricals would cause my ipython kernel > to die. > >>>>> > >>>>> I will try a bigger box this weekend > >>>>> > >>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha < > omaha at python.org > >>>>>> wrote: > >>>>> > >>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner > >>>>>> wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner > >>>>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < > >>>>>>>> luke.schollmeyer at gmail.com> wrote: > >>>>>>>> > >>>>>>>>> The quick explanation is rather than dropping outliers, I used > >>>>>> numpy's > >>>>>>>>> log1p function to help normalize distribution of the data (for > >>>>>> both the > >>>>>>>>> sale price and for all features over a certain skewness). I was > >>>>>> also > >>>>>>>>> struggling with adding in more features to the model. > >>>>>>>>> > >>>>>>>> > >>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l > >>>>>> og1p.html > >>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn. > >>>>>>>> preprocessing.FunctionTransformer.html > >>>>>>>> > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic > >>>>>>>> s)#Common_transformations > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution > >>>>>>>> > >>>>>>>> > >>>>>>>> How did you determine the skewness threshold? > >>>>>>>> > >>>>>>>> ... > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di > >>>>>>>> stribution#Specified_variance:_the_normal_distribution > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics) > >>>>>>>> > >>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no > >>>>>> rmalization > >>>>>>>> > >>>>>>> > >>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we- > >>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network > >>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz > >>>>>> ation-and- > >>>>>>> standardization-in-neural-networks > >>>>>>> > >>>>>> > >>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf > >>>>>> low/contrib/learn/python/learn > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> The training and test data sets have different "completeness" of > >>>>>> some > >>>>>>>>> features, and using pd.get_dummies can be problematic when you > fit > >>>>>> a model > >>>>>>>>> versus predicting if you don't have the same columns/features. I > >>>>>> simply > >>>>>>>>> combined the train and test data sets (without the Id and > >>>>>> SalePrice) and > >>>>>>>>> ran the get_dummies function over that set. > >>>>>>>>> > >>>>>>>> > >>>>>>>> autoclean_cv loads the train set first and then applies those > >>>>>>>> categorical/numerical mappings to the test set > >>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts > >>>>>>>> > >>>>>>>> When I modify load_house_prices [1] to also load test.csv in order > >>>>>> to > >>>>>>>> autoclean_csv, > >>>>>>>> I might try assigning the categorical levels according to the > >>>>>> ranking in > >>>>>>>> data_description.txt, > >>>>>>>> rather than the happenstance ordering in train.csv; > >>>>>>>> though get_dummies should make that irrelevant. > >>>>>>>> > >>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > >>>>>>>> e_prices/data.py#L45 > >>>>>>>> > >>>>>>>> I should probably also manually specify that 'Id' is the index > >>>>>> column in > >>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should > >>>>>> check > >>>>>>>> for). > >>>>>>>> > >>>>>>>> > >>>>>>>>> When I needed to fit the model, I just "unraveled" the combined > >>>>>> set with > >>>>>>>>> the train and test parts. > >>>>>>>>> > >>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], > >>>>>>>>> test.loc[:,'MSSubClass':'SaleCondition'])) > >>>>>>>>> > >>>>>>>>> combined = pd.get_dummies(combined) > >>>>>>>>> > >>>>>>>>> ::: do some feature engineering ::: > >>>>>>>>> > >>>>>>>>> trainX = combined[:train.shape[0]] > >>>>>>>>> y = train['SalePrice'] > >>>>>>>>> > >>>>>>>>> Just so long you don't do anything to the combined dataframe > (like > >>>>>>>>> sorting), you can slice off each part based on it's shape. > >>>>>>>>> > >>>>>>>> > >>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html# > >>>>>>>> returning-a-view-versus-a-copy > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> and when you would be pulling the data to predict the test data, > >>>>>> you get > >>>>>>>>> the other part: > >>>>>>>>> > >>>>>>>>> testX = combined[train.shape[0]:] > >>>>>>>>> > >>>>>>>> > >>>>>>>> Why is the concatenation necessary? > >>>>>>>> - log1p doesn't need the whole column > >>>>>>>> - get_dummies doesn't need the whole column > >>>>>>>> > >>>>>>> > >>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre > >>>>>> processing.StandardScaler.html > >>>>>> requires the whole column. > >>>>>> > >>>>>> ( > >>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr > >>>>>> eprocessing-scaler > >>>>>> ) > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Luke > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> (Trimmed reply-chain (again) because 40Kb limit) > >>>>>>> > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> Omaha Python Users Group mailing list > >>>>>> Omaha at python.org > >>>>>> https://mail.python.org/mailman/listinfo/omaha > >>>>>> http://www.OmahaPython.org > >>>>>> > >>>>> > >>>>> > >>>> > >>> > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From luke.schollmeyer at gmail.com Wed Dec 28 09:53:30 2016 From: luke.schollmeyer at gmail.com (Luke Schollmeyer) Date: Wed, 28 Dec 2016 08:53:30 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Nice feature engineering!. Really gets in to the "buyer mindset" for helping reward higher prices and punish lower prices. Good job using some external data. Luke On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha wrote: > Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our > score! Currently sitting at 798th place. > > My notebook is on GitHub for those interested: > > https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4 > > Jeremy > > > > On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha > wrote: > > > >> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner > wrote: > >> > >> > >> > >>> On Sunday, December 25, 2016, Bob Haffner > wrote: > >>> > >>> Merry Christmas, everyone! > >>> > >>> > >> Merry Christmas! > >> > >>> > >>> Still heading down the TPOT path with limited success. I get varying > >>> scores (tpot.score()) with the same result (kaggle scoring) > >>> > >>> Any other TPOT users getting inconsistent results? Specifically with > >>> 0.6.7? > >>> > >> > >> There may be variance because of the way TPOT splits X_train into > X_train > >> and X_test w/ train_size and test_size. > >> > >> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic > >> complexity score with a concatenation step so that X_train and X_test > have > >> the same columns (in data.py) > >> > >> It probably makes sense to use scikit-learn for data transformation > (e.g. > >> OneHotEncoder instead of get_dummies). > >> > >> https://twitter.com/westurner/status/813011289475842048 : > >> """ > >> . at scikit_learn > >> Src: https://t.co/biMt6XRt2T > >> Docs: https://t.co/Lb5EYRCdI8 > >> #API: > >> .fit_transform(X, y) > >> .fit(X_train, y_train) > >> .predict(X_test) > >> """ > >> > >> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may > not > >> prevent the oom error. > >> > >> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories", > >> there are a number of scikit-learn-compatible packages for automating > >> analysis in addition to TPOT: auto-sklearn, rep. > >> auto_ml mentions 12 algos for type_of_estimator='regressor'. > >> (and sparse matrices, and other parameters). > >> > >> https://github.com/ClimbsRocks/auto_ml > >> > >> http://auto-ml.readthedocs.io/en/latest/ > >> > > > > Here's a (probably overfitted) auto_ml attempt: > > https://github.com/westurner/house_prices/blob/ > 7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py > > > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/ > leaderboard?submissionId=3958857 > > ..."Your submission scored 9.45422, which is not an improvement of your > > best score. " > > > > Setting .train(compute_power=10) errored out after a bunch of > GridSearchCV. > > > > > >> > >> > >> I should be able to generate column_descriptions from parse_description > in > >> data.py: > >> https://github.com/westurner/house_prices/blob/develop/ > >> house_prices/data.py > >> > >> https://github.com/automl/auto-sklearn looks cool too. > >> > >> ... http://stats.stackexchange.com/questions/181/how-to- > choose-the-number- > >> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw > >> > >> http://tflearn.org > >> > >> > >>> > >>> > >>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < > >>> luke.schollmeyer at gmail.com> wrote: > >>> > >>>> Moved the needle a little bit yesterday with a ridge regression > attempt > >>>> using the same feature engineering I described before. > >>>> > >>>> Luke > >>>> > >>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner > >>>> wrote: > >>>> > >>>>> Made a TPOT attempt tonight. Could only do some numeric features > >>>>> though because including categoricals would cause my ipython kernel > to die. > >>>>> > >>>>> I will try a bigger box this weekend > >>>>> > >>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha < > omaha at python.org > >>>>>> wrote: > >>>>> > >>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner > >>>>>> wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner > >>>>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < > >>>>>>>> luke.schollmeyer at gmail.com> wrote: > >>>>>>>> > >>>>>>>>> The quick explanation is rather than dropping outliers, I used > >>>>>> numpy's > >>>>>>>>> log1p function to help normalize distribution of the data (for > >>>>>> both the > >>>>>>>>> sale price and for all features over a certain skewness). I was > >>>>>> also > >>>>>>>>> struggling with adding in more features to the model. > >>>>>>>>> > >>>>>>>> > >>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l > >>>>>> og1p.html > >>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn. > >>>>>>>> preprocessing.FunctionTransformer.html > >>>>>>>> > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic > >>>>>>>> s)#Common_transformations > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution > >>>>>>>> > >>>>>>>> > >>>>>>>> How did you determine the skewness threshold? > >>>>>>>> > >>>>>>>> ... > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di > >>>>>>>> stribution#Specified_variance:_the_normal_distribution > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics) > >>>>>>>> > >>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no > >>>>>> rmalization > >>>>>>>> > >>>>>>> > >>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we- > >>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network > >>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz > >>>>>> ation-and- > >>>>>>> standardization-in-neural-networks > >>>>>>> > >>>>>> > >>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf > >>>>>> low/contrib/learn/python/learn > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> The training and test data sets have different "completeness" of > >>>>>> some > >>>>>>>>> features, and using pd.get_dummies can be problematic when you > fit > >>>>>> a model > >>>>>>>>> versus predicting if you don't have the same columns/features. I > >>>>>> simply > >>>>>>>>> combined the train and test data sets (without the Id and > >>>>>> SalePrice) and > >>>>>>>>> ran the get_dummies function over that set. > >>>>>>>>> > >>>>>>>> > >>>>>>>> autoclean_cv loads the train set first and then applies those > >>>>>>>> categorical/numerical mappings to the test set > >>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts > >>>>>>>> > >>>>>>>> When I modify load_house_prices [1] to also load test.csv in order > >>>>>> to > >>>>>>>> autoclean_csv, > >>>>>>>> I might try assigning the categorical levels according to the > >>>>>> ranking in > >>>>>>>> data_description.txt, > >>>>>>>> rather than the happenstance ordering in train.csv; > >>>>>>>> though get_dummies should make that irrelevant. > >>>>>>>> > >>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > >>>>>>>> e_prices/data.py#L45 > >>>>>>>> > >>>>>>>> I should probably also manually specify that 'Id' is the index > >>>>>> column in > >>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should > >>>>>> check > >>>>>>>> for). > >>>>>>>> > >>>>>>>> > >>>>>>>>> When I needed to fit the model, I just "unraveled" the combined > >>>>>> set with > >>>>>>>>> the train and test parts. > >>>>>>>>> > >>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], > >>>>>>>>> test.loc[:,'MSSubClass':'SaleCondition'])) > >>>>>>>>> > >>>>>>>>> combined = pd.get_dummies(combined) > >>>>>>>>> > >>>>>>>>> ::: do some feature engineering ::: > >>>>>>>>> > >>>>>>>>> trainX = combined[:train.shape[0]] > >>>>>>>>> y = train['SalePrice'] > >>>>>>>>> > >>>>>>>>> Just so long you don't do anything to the combined dataframe > (like > >>>>>>>>> sorting), you can slice off each part based on it's shape. > >>>>>>>>> > >>>>>>>> > >>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html# > >>>>>>>> returning-a-view-versus-a-copy > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> and when you would be pulling the data to predict the test data, > >>>>>> you get > >>>>>>>>> the other part: > >>>>>>>>> > >>>>>>>>> testX = combined[train.shape[0]:] > >>>>>>>>> > >>>>>>>> > >>>>>>>> Why is the concatenation necessary? > >>>>>>>> - log1p doesn't need the whole column > >>>>>>>> - get_dummies doesn't need the whole column > >>>>>>>> > >>>>>>> > >>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre > >>>>>> processing.StandardScaler.html > >>>>>> requires the whole column. > >>>>>> > >>>>>> ( > >>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr > >>>>>> eprocessing-scaler > >>>>>> ) > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Luke > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> (Trimmed reply-chain (again) because 40Kb limit) > >>>>>>> > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> Omaha Python Users Group mailing list > >>>>>> Omaha at python.org > >>>>>> https://mail.python.org/mailman/listinfo/omaha > >>>>>> http://www.OmahaPython.org > >>>>>> > >>>>> > >>>>> > >>>> > >>> > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bob.haffner at gmail.com Wed Dec 28 10:36:17 2016 From: bob.haffner at gmail.com (Bob Haffner) Date: Wed, 28 Dec 2016 09:36:17 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: leaderboard update https://github.com/bobhaffner/kaggle-houseprices/blob/master/kaggle_house_prices_leaderboard.ipynb On Wed, Dec 28, 2016 at 8:53 AM, Luke Schollmeyer via Omaha < omaha at python.org> wrote: > Nice feature engineering!. Really gets in to the "buyer mindset" for > helping reward higher prices and punish lower prices. Good job using some > external data. > > Luke > > On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha > > wrote: > > > Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our > > score! Currently sitting at 798th place. > > > > My notebook is on GitHub for those interested: > > > > https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4 > > > > Jeremy > > > > > > > On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha > > wrote: > > > > > >> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner > > wrote: > > >> > > >> > > >> > > >>> On Sunday, December 25, 2016, Bob Haffner > > wrote: > > >>> > > >>> Merry Christmas, everyone! > > >>> > > >>> > > >> Merry Christmas! > > >> > > >>> > > >>> Still heading down the TPOT path with limited success. I get varying > > >>> scores (tpot.score()) with the same result (kaggle scoring) > > >>> > > >>> Any other TPOT users getting inconsistent results? Specifically > with > > >>> 0.6.7? > > >>> > > >> > > >> There may be variance because of the way TPOT splits X_train into > > X_train > > >> and X_test w/ train_size and test_size. > > >> > > >> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic > > >> complexity score with a concatenation step so that X_train and X_test > > have > > >> the same columns (in data.py) > > >> > > >> It probably makes sense to use scikit-learn for data transformation > > (e.g. > > >> OneHotEncoder instead of get_dummies). > > >> > > >> https://twitter.com/westurner/status/813011289475842048 : > > >> """ > > >> . at scikit_learn > > >> Src: https://t.co/biMt6XRt2T > > >> Docs: https://t.co/Lb5EYRCdI8 > > >> #API: > > >> .fit_transform(X, y) > > >> .fit(X_train, y_train) > > >> .predict(X_test) > > >> """ > > >> > > >> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may > > not > > >> prevent the oom error. > > >> > > >> Looking at https://libraries.io/pypi/xgboost "Dependent > Repositories", > > >> there are a number of scikit-learn-compatible packages for automating > > >> analysis in addition to TPOT: auto-sklearn, rep. > > >> auto_ml mentions 12 algos for type_of_estimator='regressor'. > > >> (and sparse matrices, and other parameters). > > >> > > >> https://github.com/ClimbsRocks/auto_ml > > >> > > >> http://auto-ml.readthedocs.io/en/latest/ > > >> > > > > > > Here's a (probably overfitted) auto_ml attempt: > > > https://github.com/westurner/house_prices/blob/ > > 7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/ > analysis_auto_ml.py > > > > > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/ > > leaderboard?submissionId=3958857 > > > ..."Your submission scored 9.45422, which is not an improvement of your > > > best score. " > > > > > > Setting .train(compute_power=10) errored out after a bunch of > > GridSearchCV. > > > > > > > > >> > > >> > > >> I should be able to generate column_descriptions from > parse_description > > in > > >> data.py: > > >> https://github.com/westurner/house_prices/blob/develop/ > > >> house_prices/data.py > > >> > > >> https://github.com/automl/auto-sklearn looks cool too. > > >> > > >> ... http://stats.stackexchange.com/questions/181/how-to- > > choose-the-number- > > >> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw > > >> > > >> http://tflearn.org > > >> > > >> > > >>> > > >>> > > >>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < > > >>> luke.schollmeyer at gmail.com> wrote: > > >>> > > >>>> Moved the needle a little bit yesterday with a ridge regression > > attempt > > >>>> using the same feature engineering I described before. > > >>>> > > >>>> Luke > > >>>> > > >>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner > > > >>>> wrote: > > >>>> > > >>>>> Made a TPOT attempt tonight. Could only do some numeric features > > >>>>> though because including categoricals would cause my ipython kernel > > to die. > > >>>>> > > >>>>> I will try a bigger box this weekend > > >>>>> > > >>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha < > > omaha at python.org > > >>>>>> wrote: > > >>>>> > > >>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner > > > >>>>>> wrote: > > >>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner < > wes.turner at gmail.com> > > >>>>>> wrote: > > >>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < > > >>>>>>>> luke.schollmeyer at gmail.com> wrote: > > >>>>>>>> > > >>>>>>>>> The quick explanation is rather than dropping outliers, I used > > >>>>>> numpy's > > >>>>>>>>> log1p function to help normalize distribution of the data (for > > >>>>>> both the > > >>>>>>>>> sale price and for all features over a certain skewness). I was > > >>>>>> also > > >>>>>>>>> struggling with adding in more features to the model. > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l > > >>>>>> og1p.html > > >>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn. > > >>>>>>>> preprocessing.FunctionTransformer.html > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic > > >>>>>>>> s)#Common_transformations > > >>>>>>>> > > >>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> How did you determine the skewness threshold? > > >>>>>>>> > > >>>>>>>> ... > > >>>>>>>> > > >>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di > > >>>>>>>> stribution#Specified_variance:_the_normal_distribution > > >>>>>>>> > > >>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics) > > >>>>>>>> > > >>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no > > >>>>>> rmalization > > >>>>>>>> > > >>>>>>> > > >>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we- > > >>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network > > >>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz > > >>>>>> ation-and- > > >>>>>>> standardization-in-neural-networks > > >>>>>>> > > >>>>>> > > >>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf > > >>>>>> low/contrib/learn/python/learn > > >>>>>> > > >>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> The training and test data sets have different "completeness" > of > > >>>>>> some > > >>>>>>>>> features, and using pd.get_dummies can be problematic when you > > fit > > >>>>>> a model > > >>>>>>>>> versus predicting if you don't have the same columns/features. > I > > >>>>>> simply > > >>>>>>>>> combined the train and test data sets (without the Id and > > >>>>>> SalePrice) and > > >>>>>>>>> ran the get_dummies function over that set. > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> autoclean_cv loads the train set first and then applies those > > >>>>>>>> categorical/numerical mappings to the test set > > >>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts > > >>>>>>>> > > >>>>>>>> When I modify load_house_prices [1] to also load test.csv in > order > > >>>>>> to > > >>>>>>>> autoclean_csv, > > >>>>>>>> I might try assigning the categorical levels according to the > > >>>>>> ranking in > > >>>>>>>> data_description.txt, > > >>>>>>>> rather than the happenstance ordering in train.csv; > > >>>>>>>> though get_dummies should make that irrelevant. > > >>>>>>>> > > >>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > > >>>>>>>> e_prices/data.py#L45 > > >>>>>>>> > > >>>>>>>> I should probably also manually specify that 'Id' is the index > > >>>>>> column in > > >>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas > should > > >>>>>> check > > >>>>>>>> for). > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> When I needed to fit the model, I just "unraveled" the combined > > >>>>>> set with > > >>>>>>>>> the train and test parts. > > >>>>>>>>> > > >>>>>>>>> combined = pd.concat((train.loc[:,' > MSSubClass':'SaleCondition'], > > >>>>>>>>> test.loc[:,'MSSubClass':' > SaleCondition'])) > > >>>>>>>>> > > >>>>>>>>> combined = pd.get_dummies(combined) > > >>>>>>>>> > > >>>>>>>>> ::: do some feature engineering ::: > > >>>>>>>>> > > >>>>>>>>> trainX = combined[:train.shape[0]] > > >>>>>>>>> y = train['SalePrice'] > > >>>>>>>>> > > >>>>>>>>> Just so long you don't do anything to the combined dataframe > > (like > > >>>>>>>>> sorting), you can slice off each part based on it's shape. > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html# > > >>>>>>>> returning-a-view-versus-a-copy > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> and when you would be pulling the data to predict the test > data, > > >>>>>> you get > > >>>>>>>>> the other part: > > >>>>>>>>> > > >>>>>>>>> testX = combined[train.shape[0]:] > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> Why is the concatenation necessary? > > >>>>>>>> - log1p doesn't need the whole column > > >>>>>>>> - get_dummies doesn't need the whole column > > >>>>>>>> > > >>>>>>> > > >>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre > > >>>>>> processing.StandardScaler.html > > >>>>>> requires the whole column. > > >>>>>> > > >>>>>> ( > > >>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr > > >>>>>> eprocessing-scaler > > >>>>>> ) > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>>> > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Luke > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> (Trimmed reply-chain (again) because 40Kb limit) > > >>>>>>> > > >>>>>>> > > >>>>>> _______________________________________________ > > >>>>>> Omaha Python Users Group mailing list > > >>>>>> Omaha at python.org > > >>>>>> https://mail.python.org/mailman/listinfo/omaha > > >>>>>> http://www.OmahaPython.org > > >>>>>> > > >>>>> > > >>>>> > > >>>> > > >>> > > > _______________________________________________ > > > Omaha Python Users Group mailing list > > > Omaha at python.org > > > https://mail.python.org/mailman/listinfo/omaha > > > http://www.OmahaPython.org > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Wed Dec 28 13:01:19 2016 From: wes.turner at gmail.com (Wes Turner) Date: Wed, 28 Dec 2016 12:01:19 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha wrote: > Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our > score! Currently sitting at 798th place. > Nice work! Features of your feature engineering I admire: - nominal, ordinal, continuous, discrete categorical = nominal + discrete numeric = continuous + discrete - outlier removal - [ ] w/ constant thresholding? (is there a distribution parameter) - building datestrings from SaleMonth and YrSold - SaleMonth / "1" / YrSold - df..drop(['MoSold','YrSold','SaleMonth']) - [ ] why drop SaleMonth? - [ ] pandas.to_datetime[df['SaleMonth']) - merging with FHA Home Price Index for the month and region ("West North Central") https://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_PO_monthly_hist.xls - [ ] pandas.to_datetime - this should have every month, but the new merge_asof feature is worth mentioning - manual binarization - [ ] how did you pick these? correlation after pd.get_dummies? - [ ] why floats? 1.0 / 1 (does it make a difference?) - Ames, IA nbrhood_multiplier - http://www.cityofames.org/home/showdocument?id=1024 - feature merging - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + (HalfBath / 2.0) - ( ) IDK how a feature-selection pipeline could do this automatically - null value imputation - .isnull() = 0 - ( ) datacleaner incorrectly sets these to median or mode - log for skewed continuous and SalePrice - ( ) auto_ml: take_log_of_y does this for SalePrice - "Keeping only the columns we want" - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, index_col='Id') - Binarization - pd.get_dummies(dummy_na=False) - [ ] (a Luke pointed out, concatenation keeps the same columns) rows = eng_train.shape[0] eng_merged = pd.concat(eng_train, eng_test) onehot_merged = pd.get_dummies(eng_merged, columns=nominal, dummy_na=False) onehot_train = eng_merged[:rows] onehot_test = eng_merged[rows:] - class RandomSelectionHelper - [ ] this could be generally helpful in sklean[-pandas] - https://github.com/paulgb/sklearn-pandas#cross-validation - Models to Search - {Ridge, Lasso, ElasticNet} - https://github.com/ClimbsRocks/auto_ml/blob/master/auto_ml/predictor.py#L222 _get_estimator_names ( "regressor" ) - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, RandomForestRegressor, LinearRegression, AdaBoostRegressor, ExtraTreesRegressor} - https://github.com/ClimbsRocks/auto_ml/blob/master/auto_ml/predictor.py#L491 - (w/ ensembling) - ['RandomForestRegressor', 'LinearRegression', 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' XGBRegressor'] - model stacking / ensembling - ( ) auto_ml: https://auto-ml.readthedocs.io/en/latest/ensembling.html - ( ) auto-sklearn: https://automl.github.io/auto-sklearn/stable/api.html#autosklearn.regression.AutoSklearnRegressor ensemble_size=50, ensemble_nbest=50 - submission['SalePrice'] = submission.SalePrice.apply(lambda x: np.exp(x)) - [ ] What is this called / how does this work? - https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also works - http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html > My notebook is on GitHub for those interested: > > https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4 Thanks! > > > Jeremy > > > > On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha > wrote: > > > >> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner > wrote: > >> > >> > >> > >>> On Sunday, December 25, 2016, Bob Haffner > wrote: > >>> > >>> Merry Christmas, everyone! > >>> > >>> > >> Merry Christmas! > >> > >>> > >>> Still heading down the TPOT path with limited success. I get varying > >>> scores (tpot.score()) with the same result (kaggle scoring) > >>> > >>> Any other TPOT users getting inconsistent results? Specifically with > >>> 0.6.7? > >>> > >> > >> There may be variance because of the way TPOT splits X_train into > X_train > >> and X_test w/ train_size and test_size. > >> > >> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic > >> complexity score with a concatenation step so that X_train and X_test > have > >> the same columns (in data.py) > >> > >> It probably makes sense to use scikit-learn for data transformation > (e.g. > >> OneHotEncoder instead of get_dummies). > >> > >> https://twitter.com/westurner/status/813011289475842048 : > >> """ > >> . at scikit_learn > >> Src: https://t.co/biMt6XRt2T > >> Docs: https://t.co/Lb5EYRCdI8 > >> #API: > >> .fit_transform(X, y) > >> .fit(X_train, y_train) > >> .predict(X_test) > >> """ > >> > >> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may > not > >> prevent the oom error. > >> > >> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories", > >> there are a number of scikit-learn-compatible packages for automating > >> analysis in addition to TPOT: auto-sklearn, rep. > >> auto_ml mentions 12 algos for type_of_estimator='regressor'. > >> (and sparse matrices, and other parameters). > >> > >> https://github.com/ClimbsRocks/auto_ml > >> > >> http://auto-ml.readthedocs.io/en/latest/ > >> > > > > Here's a (probably overfitted) auto_ml attempt: > > https://github.com/westurner/house_prices/blob/ > 7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py > > > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/ > leaderboard?submissionId=3958857 > > ..."Your submission scored 9.45422, which is not an improvement of your > > best score. " > > > > Setting .train(compute_power=10) errored out after a bunch of > GridSearchCV. > > > > > >> > >> > >> I should be able to generate column_descriptions from parse_description > in > >> data.py: > >> https://github.com/westurner/house_prices/blob/develop/ > >> house_prices/data.py > >> > >> https://github.com/automl/auto-sklearn looks cool too. > >> > >> ... http://stats.stackexchange.com/questions/181/how-to- > choose-the-number- > >> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw > >> > >> http://tflearn.org > >> > >> > >>> > >>> > >>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer < > >>> luke.schollmeyer at gmail.com> wrote: > >>> > >>>> Moved the needle a little bit yesterday with a ridge regression > attempt > >>>> using the same feature engineering I described before. > >>>> > >>>> Luke > >>>> > >>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner > >>>> wrote: > >>>> > >>>>> Made a TPOT attempt tonight. Could only do some numeric features > >>>>> though because including categoricals would cause my ipython kernel > to die. > >>>>> > >>>>> I will try a bigger box this weekend > >>>>> > >>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha < > omaha at python.org > >>>>>> wrote: > >>>>> > >>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner > >>>>>> wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner > >>>>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer < > >>>>>>>> luke.schollmeyer at gmail.com> wrote: > >>>>>>>> > >>>>>>>>> The quick explanation is rather than dropping outliers, I used > >>>>>> numpy's > >>>>>>>>> log1p function to help normalize distribution of the data (for > >>>>>> both the > >>>>>>>>> sale price and for all features over a certain skewness). I was > >>>>>> also > >>>>>>>>> struggling with adding in more features to the model. > >>>>>>>>> > >>>>>>>> > >>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l > >>>>>> og1p.html > >>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn. > >>>>>>>> preprocessing.FunctionTransformer.html > >>>>>>>> > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic > >>>>>>>> s)#Common_transformations > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution > >>>>>>>> > >>>>>>>> > >>>>>>>> How did you determine the skewness threshold? > >>>>>>>> > >>>>>>>> ... > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di > >>>>>>>> stribution#Specified_variance:_the_normal_distribution > >>>>>>>> > >>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics) > >>>>>>>> > >>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no > >>>>>> rmalization > >>>>>>>> > >>>>>>> > >>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we- > >>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network > >>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz > >>>>>> ation-and- > >>>>>>> standardization-in-neural-networks > >>>>>>> > >>>>>> > >>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf > >>>>>> low/contrib/learn/python/learn > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> The training and test data sets have different "completeness" of > >>>>>> some > >>>>>>>>> features, and using pd.get_dummies can be problematic when you > fit > >>>>>> a model > >>>>>>>>> versus predicting if you don't have the same columns/features. I > >>>>>> simply > >>>>>>>>> combined the train and test data sets (without the Id and > >>>>>> SalePrice) and > >>>>>>>>> ran the get_dummies function over that set. > >>>>>>>>> > >>>>>>>> > >>>>>>>> autoclean_cv loads the train set first and then applies those > >>>>>>>> categorical/numerical mappings to the test set > >>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts > >>>>>>>> > >>>>>>>> When I modify load_house_prices [1] to also load test.csv in order > >>>>>> to > >>>>>>>> autoclean_csv, > >>>>>>>> I might try assigning the categorical levels according to the > >>>>>> ranking in > >>>>>>>> data_description.txt, > >>>>>>>> rather than the happenstance ordering in train.csv; > >>>>>>>> though get_dummies should make that irrelevant. > >>>>>>>> > >>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous > >>>>>>>> e_prices/data.py#L45 > >>>>>>>> > >>>>>>>> I should probably also manually specify that 'Id' is the index > >>>>>> column in > >>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should > >>>>>> check > >>>>>>>> for). > >>>>>>>> > >>>>>>>> > >>>>>>>>> When I needed to fit the model, I just "unraveled" the combined > >>>>>> set with > >>>>>>>>> the train and test parts. > >>>>>>>>> > >>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], > >>>>>>>>> test.loc[:,'MSSubClass':'SaleCondition'])) > >>>>>>>>> > >>>>>>>>> combined = pd.get_dummies(combined) > >>>>>>>>> > >>>>>>>>> ::: do some feature engineering ::: > >>>>>>>>> > >>>>>>>>> trainX = combined[:train.shape[0]] > >>>>>>>>> y = train['SalePrice'] > >>>>>>>>> > >>>>>>>>> Just so long you don't do anything to the combined dataframe > (like > >>>>>>>>> sorting), you can slice off each part based on it's shape. > >>>>>>>>> > >>>>>>>> > >>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html# > >>>>>>>> returning-a-view-versus-a-copy > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> and when you would be pulling the data to predict the test data, > >>>>>> you get > >>>>>>>>> the other part: > >>>>>>>>> > >>>>>>>>> testX = combined[train.shape[0]:] > >>>>>>>>> > >>>>>>>> > >>>>>>>> Why is the concatenation necessary? > >>>>>>>> - log1p doesn't need the whole column > >>>>>>>> - get_dummies doesn't need the whole column > >>>>>>>> > >>>>>>> > >>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre > >>>>>> processing.StandardScaler.html > >>>>>> requires the whole column. > >>>>>> > >>>>>> ( > >>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr > >>>>>> eprocessing-scaler > >>>>>> ) > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Luke > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> (Trimmed reply-chain (again) because 40Kb limit) > >>>>>>> > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> Omaha Python Users Group mailing list > >>>>>> Omaha at python.org > >>>>>> https://mail.python.org/mailman/listinfo/omaha > >>>>>> http://www.OmahaPython.org > >>>>>> > >>>>> > >>>>> > >>>> > >>> > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org >