[omaha] Group Data Science Competition

Tue Feb 14 00:10:44 EST 2017

On Sun, Feb 12, 2017 at 6:07 PM, Wes Turner <wes.turner at gmail.com> wrote:

>
> On Saturday, February 11, 2017, Bob Haffner via Omaha <omaha at python.org>
> wrote:
>
>> Wes, I didn't check the MSE.  I need to though as my submission didn't
>> score well at all  :-)
>>
>> I used TensorFlow as the backend.   Also, I used the KerasRegressor model
>> so that made things pretty simple
>
>
> There are so many neural network / deep learning topologies:
> http://www.asimovinstitute.org/neural-network-zoo/
>
>   "A mostly complete chart of neural networks"
>   http://www.asimovinstitute.org/wp-content/uploads/2016/09/
> neuralnetworks.png
>
> ...
>
> How does changing the nb_epoch parameter affect the .score()?
> http://stackoverflow.com/questions/36936209/how-to-make-keras-neural-net-
> outperforming-logistic-regression-on-iris-data
>
> How does adding one or more layers affect the objective statistic?
>
> Does creating a datetime64 (or UNIX time) from the year and month improve
> the output? Is this learnable as separate fields?
>
> http://blog.fastforwardlabs.com/2016/02/24/hello-world-in-
> keras-or-scikit-learn-versus.html
>
> How does the extra data prep and merged home price index data by Jeremy
> Doyle affect this NN model?
> https://en.wikipedia.org/wiki/Data_wrangling#See_also
>
>
*

> @jeremydoyle
> Could you wrap up your feature logic as an sklearn .transform()-able class?
>

http://scikit-learn.org/stable/data_transforms.html

https://stackoverflow.com/questions/25539311/custom-transformer-for-sklearn-pipeline-that-alters-both-x-and-y

-
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html
- https://github.com/scikit-learn/scikit-learn/issues/3855 "Resampler
estimators that change the sample size in fitting"
  - according to this, in order to integrate features of our team's various
models with scikit-learn, we should have a separate preprocessing pipeline
(with fit, transform, fit_transform); followed by the fit() and predict()
pipeline if we modify y. to avoid dropping NaN/NULLs, we instead impute
using various strategies:

- https://github.com/rhiever/datacleaner/issues/1#issuecomment-279607720
re: imputation w/ scikit-learn

* https://en.wikipedia.org/wiki/Imputation_(statistics) :

  > In statistics, imputation is the process of replacing missing data with
substituted values. When substituting for a data point, it is known as
"unit imputation"; when substituting for a component of a data point, it is
known as "item imputation". **There are three main problems that missing
data causes: *missing data can introduce a substantial amount of bias*,
make the handling and analysis of the data more arduous, and create
reductions in efficiency.**[1] Because missing data can create problems for
analyzing data, imputation is seen as a way to avoid pitfalls involved with
listwise deletion of cases that have missing values. That is to say, when
one or more values are missing for a case, most statistical packages
default to discarding any case that has a missing value, which may
introduce bias or affect the representativeness of the results. Imputation
preserves all cases by replacing missing data with an estimated value based
on other available information. Once all missing values have been imputed,
the data set can then be analysed using standard techniques for complete
data.[2] Imputation theory is constantly developing and thus requires
consistent attention to new information regarding the subject. There have
been many theories embraced by scientists to account for missing data but
the majority of them introduce large amounts of bias. A few of the well
known attempts to deal with missing data include: **hot deck and cold deck
imputation; listwise and pairwise deletion; mean imputation; regression
imputation; last observation carried forward; stochastic imputation; and
multiple imputation.** [emphasis added]

*
http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values
*
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
*
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
  * class sklearn.preprocessing.Imputer(, strategy=('mean', 'median',
'most_frequent'), )

* http://scikit-learn.org/stable/auto_examples/missing_values.html :
  > **Imputing missing values before building an estimator**¶
  > This example shows that imputing the missing values can give better
results than discarding the samples containing any missing value. Imputing
does not always improve the predictions, so please check via
cross-validation. Sometimes dropping rows or using marker values is more
effective.
Missing values can be replaced by the mean, the median or the most frequent
value using the strategy hyper-parameter. The median is a more robust
estimator for data with high magnitude variables which could dominate
results (otherwise known as a ‘long tail’)."

>
> It's possible to import *named functions* and/or data from .ipynb
> notebooks with pypi:ipynb.
>

ipynb - Package / Module importer for importing code from Jupyter Notebook
files (.ipynb)
| Src: https://github.com/ipython/ipynb
| Docs: https://ipynb.readthedocs.io/en/latest/
| PyPI: https://pypi.python.org/pypi/ipynb

>
> If we can agree on a functional interface (sklearn
> transform[/fit/predict/]), we could mix and match sklearn pipeline
> functions in our omahapython/kaggle-houseprices repo !:
> https://github.com/omahapython/kaggle-houseprices/tree/master/src
>
> For lack of a better namespace scheme for module import,
> I added my repo as a git submodule prefixed with my username:
> src/westurner_house_prices -> gh:westurner/kaggle-houseprices ~master
>
>
> (
> Btw,
> - I just summarized the answers to "Using IPython notebooks under version
> control" http://stackoverflow.com/questions/18734739/using-
> ipython-notebooks-under-version-control/42128373#42128373
>   - Nbdime contains {nbdiff, nbmerge}
>   - auto-save .Ipynb as .py pre/post_save()
>   - `nbconvert --to python` on_commit()
>   - auto-create a reate a .clean.ipynb (or .strippedoutput.ipynb) with eg
> nbstripout
> ...
> - I like runipy because it re-numbers all the cell's and prints to
> stdout/stderr
> )
>
> In order to crossover, mutate, and improve our score, I think we should
> compose a few scikit-learn pipelines which combine our group efforts?
>

>
> - [ ] git submodules in src/
>

https://github.com/blog/2104-working-with-submodules

> - [ ] importable *named functions* w/o side effects
>

see FunctionTransformer (link above)

> - [ ] DOC: :returns: pd.DataFrame, np., sklearn API
> - [ ] a standard local model scoring function (because what does the
> kaggle score even mean)
>   - real dollars deviance: sum(abs(residuals))
>

>
>
- MSE sum(residuals**2)

> sklearn API
> http://scikit-learn.org/stable/modules/pipeline.html :
>
> """
>
> Pipeline
> <http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline> can
> be used to chain multiple estimators into one. This is useful as there is
> often a fixed sequence of steps in processing the data, for example feature
> selection, normalization and classification. Pipeline
> <http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline>
>  serves two purposes here:
>
> *Convenience*: You only have to call fit and predict once on your data to
> fit a whole sequence of estimators.
>
> *Joint parameter selection*: You can grid search
> <http://scikit-learn.org/stable/modules/grid_search.html#grid-search> over
> parameters of all estimators in the pipeline at once.
>
> All estimators in a pipeline, except the last one, must be transformers
> (i.e. must have a transformmethod). The last estimator may be any type
> (transformer, classifier, etc.).
> """
>

sklearn.base.TransformerMixin
http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#sklearn.base.TransformerMixin

sklearn.preprocessing.FunctionTransformer.html
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html

>
>
> https://keras.io/scikit-learn-api/
> - (*grid search for hyper parameters)
> - KerasRegressor works as a sklearn pipeline step
>
>
>> On Fri, Feb 10, 2017 at 10:13 AM, Wes Turner <wes.turner at gmail.com>
>> wrote:
>>
>> > Anyone have a good way to re.find all the links in this thread?
>> >
>> > - [ ] linkgrep(thread) >> wiki
>> > - [ ] http://www.datatau.com is like news.ycombinator.com for Data
>> Science
>> >
>> > @bob
>> > Cool.
>> > What is the (MSE, $ deviance) with Keras?
>> >
>> > Keras with Theano or TensorFlow?
>> >
>> > On Thursday, February 9, 2017, Bob Haffner via Omaha <omaha at python.org>
>> > wrote:
>> >
>> >> Added a Deep Learning section to my notebook
>> >> https://github.com/bobhaffner/kaggle-houseprices/blob/master
>> >> /kaggle_house_prices.ipynb
>> >>
>> >> Using Keras for the modeling with TensorFlow as the backend.
>> >>
>> >> I've generated a submission, but I don't know how it performed as
>> Kaggle
>> >> seems to be on the fritz tonight.
>> >>
>> >> On Sat, Jan 14, 2017 at 12:52 AM, Wes Turner <wes.turner at gmail.com>
>> >> wrote:
>> >>
>> >> >
>> >> >
>> > (... Trimmed )
>
>