[omaha] Group Data Science Competition

Sun Dec 25 20:40:54 EST 2016

On Sunday, December 25, 2016, Bob Haffner <bob.haffner at gmail.com> wrote:

> Merry Christmas, everyone!
>
>
Merry Christmas!

>
>  Still heading down the TPOT path with limited success.  I get varying
> scores (tpot.score()) with the same result (kaggle scoring)
>
> Any other TPOT users getting inconsistent results?   Specifically with
> 0.6.7?
>

There may be variance because of the way TPOT splits X_train into X_train
and X_test w/ train_size and test_size.

I rewrote load_house_prices as a class w/ a better mccabe cyclomatic
complexity score with a concatenation step so that X_train and X_test have
the same columns (in data.py)

It probably makes sense to use scikit-learn for data transformation (e.g.
OneHotEncoder instead of get_dummies).

https://twitter.com/westurner/status/813011289475842048 :
"""
. at scikit_learn
Src: https://t.co/biMt6XRt2T
Docs: https://t.co/Lb5EYRCdI8
#API:
.fit_transform(X, y)
.fit(X_train, y_train)
.predict(X_test)
"""

I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not
prevent the oom error.

Looking at https://libraries.io/pypi/xgboost "Dependent Repositories",
there are a number of scikit-learn-compatible packages for automating
analysis in addition to TPOT: auto-sklearn, rep.
auto_ml mentions 12 algos for type_of_estimator='regressor'.
(and sparse matrices, and other parameters).

https://github.com/ClimbsRocks/auto_ml

http://auto-ml.readthedocs.io/en/latest/

I should be able to generate column_descriptions from parse_description in
data.py:
https://github.com/westurner/house_prices/blob/develop/house_prices/data.py

https://github.com/automl/auto-sklearn looks cool too.

...
http://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

http://tflearn.org

>
>
> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer <
> luke.schollmeyer at gmail.com
> <javascript:_e(%7B%7D,'cvml','luke.schollmeyer at gmail.com');>> wrote:
>
>> Moved the needle a little bit yesterday with a ridge regression attempt
>> using the same feature engineering I described before.
>>
>> Luke
>>
>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner <bob.haffner at gmail.com
>> <javascript:_e(%7B%7D,'cvml','bob.haffner at gmail.com');>> wrote:
>>
>>> Made a TPOT attempt tonight.  Could only do some numeric features though
>>> because including categoricals would cause my ipython kernel to die.
>>>
>>> I will try a bigger box this weekend
>>>
>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha <omaha at python.org
>>> <javascript:_e(%7B%7D,'cvml','omaha at python.org');>> wrote:
>>>
>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner <wes.turner at gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','wes.turner at gmail.com');>> wrote:
>>>>
>>>> >
>>>> >
>>>> > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner <wes.turner at gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','wes.turner at gmail.com');>> wrote:
>>>> >
>>>> >>
>>>> >>
>>>> >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer <
>>>> >> luke.schollmeyer at gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','luke.schollmeyer at gmail.com');>> wrote:
>>>> >>
>>>> >>> The quick explanation is rather than dropping outliers, I used
>>>> numpy's
>>>> >>> log1p function to help normalize distribution of the data (for both
>>>> the
>>>> >>> sale price and for all features over a certain skewness). I was also
>>>> >>> struggling with adding in more features to the model.
>>>> >>>
>>>> >>
>>>> >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l
>>>> og1p.html
>>>> >> - http://scikit-learn.org/stable/modules/generated/sklearn.
>>>> >> preprocessing.FunctionTransformer.html
>>>> >>
>>>> >>
>>>> >> https://en.wikipedia.org/wiki/Data_transformation_(statistic
>>>> >> s)#Common_transformations
>>>> >>
>>>> >> https://en.wikipedia.org/wiki/Log-normal_distribution
>>>> >>
>>>> >>
>>>> >> How did you determine the skewness threshold?
>>>> >>
>>>> >> ...
>>>> >>
>>>> >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di
>>>> >> stribution#Specified_variance:_the_normal_distribution
>>>> >>
>>>> >> https://en.wikipedia.org/wiki/Normalization_(statistics)
>>>> >>
>>>> >> http://scikit-learn.org/stable/modules/preprocessing.html#no
>>>> rmalization
>>>> >>
>>>> >
>>>> > - https://stackoverflow.com/questions/4674623/why-do-we-
>>>> > have-to-normalize-the-input-for-an-artificial-neural-network
>>>> > - https://stats.stackexchange.com/questions/7757/data-normaliz
>>>> ation-and-
>>>> > standardization-in-neural-networks
>>>> >
>>>>
>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf
>>>> low/contrib/learn/python/learn
>>>>
>>>>
>>>> >
>>>> >
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>> The training and test data sets have different "completeness" of
>>>> some
>>>> >>> features, and using pd.get_dummies can be problematic when you fit
>>>> a model
>>>> >>> versus predicting if you don't have the same columns/features. I
>>>> simply
>>>> >>> combined the train and test data sets (without the Id and
>>>> SalePrice) and
>>>> >>> ran the get_dummies function over that set.
>>>> >>>
>>>> >>
>>>> >> autoclean_cv loads the train set first and then applies those
>>>> >> categorical/numerical mappings to the test set
>>>> >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
>>>> >>
>>>> >> When I modify load_house_prices [1] to also load test.csv in order to
>>>> >> autoclean_csv,
>>>> >> I might try assigning the categorical levels according to the
>>>> ranking in
>>>> >> data_description.txt,
>>>> >> rather than the happenstance ordering in train.csv;
>>>> >> though get_dummies should make that irrelevant.
>>>> >>
>>>> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>>> >> e_prices/data.py#L45
>>>> >>
>>>> >> I should probably also manually specify that 'Id' is the index
>>>> column in
>>>> >> pd.read_csv (assuming there are no duplicates, which pandas should
>>>> check
>>>> >> for).
>>>> >>
>>>> >>
>>>> >>> When I needed to fit the model, I just "unraveled" the combined set
>>>> with
>>>> >>> the train and test parts.
>>>> >>>
>>>> >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
>>>> >>>                       test.loc[:,'MSSubClass':'SaleCondition']))
>>>> >>>
>>>> >>> combined = pd.get_dummies(combined)
>>>> >>>
>>>> >>> ::: do some feature engineering :::
>>>> >>>
>>>> >>> trainX = combined[:train.shape[0]]
>>>> >>> y = train['SalePrice']
>>>> >>>
>>>> >>> Just so long you don't do anything to the combined dataframe (like
>>>> >>> sorting), you can slice off each part based on it's shape.
>>>> >>>
>>>> >>
>>>> >> http://pandas.pydata.org/pandas-docs/stable/indexing.html#
>>>> >> returning-a-view-versus-a-copy
>>>> >>
>>>> >>
>>>> >>>
>>>> >>> and when you would be pulling the data to predict the test data,
>>>> you get
>>>> >>> the other part:
>>>> >>>
>>>> >>> testX = combined[train.shape[0]:]
>>>> >>>
>>>> >>
>>>> >> Why is the concatenation necessary?
>>>> >> - log1p doesn't need the whole column
>>>> >> - get_dummies doesn't need the whole column
>>>> >>
>>>> >
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre
>>>> processing.StandardScaler.html
>>>> requires the whole column.
>>>>
>>>> (
>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr
>>>> eprocessing-scaler
>>>> )
>>>>
>>>>
>>>>
>>>>
>>>> >
>>>> >>
>>>> >>>
>>>> >>>
>>>> >>> Luke
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>
>>>> > (Trimmed reply-chain (again) because 40Kb limit)
>>>> >
>>>> >
>>>> _______________________________________________
>>>> Omaha Python Users Group mailing list
>>>> Omaha at python.org <javascript:_e(%7B%7D,'cvml','Omaha at python.org');>
>>>> https://mail.python.org/mailman/listinfo/omaha
>>>> http://www.OmahaPython.org
>>>>
>>>
>>>
>>
>