[omaha] Group Data Science Competition

>>> Merry Christmas, everyone!
>> Merry Christmas!
>>> Still heading down the TPOT path with limited success.  I get varying
>>> scores (tpot.score()) with the same result (kaggle scoring)
>>> Any other TPOT users getting inconsistent results?   Specifically with
>>> 0.6.7?
>> There may be variance because of the way TPOT splits X_train into X_train
>> and X_test w/ train_size and test_size.
>> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic
>> complexity score with a concatenation step so that X_train and X_test have
>> the same columns (in data.py)
>> It probably makes sense to use scikit-learn for data transformation (e.g.
>> OneHotEncoder instead of get_dummies).
>> https://twitter.com/westurner/status/813011289475842048 :
>> """
>> . at scikit_learn
>> Src: https://t.co/biMt6XRt2T
>> Docs: https://t.co/Lb5EYRCdI8
>> #API:
>> .fit_transform(X, y)
>> .fit(X_train, y_train)
>> .predict(X_test)
>> """
>> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not
>> prevent the oom error.
>> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories",
>> there are a number of scikit-learn-compatible packages for automating
>> analysis in addition to TPOT: auto-sklearn, rep.
>> auto_ml mentions 12 algos for type_of_estimator='regressor'.
>> (and sparse matrices, and other parameters).
>> https://github.com/ClimbsRocks/auto_ml
>> http://auto-ml.readthedocs.io/en/latest/
> Here's a (probably overfitted) auto_ml attempt:
> https://github.com/westurner/house_prices/blob/7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py
> https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3958857
> ..."Your submission scored 9.45422, which is not an improvement of your
> best score. "
> Setting .train(compute_power=10) errored out after a bunch of GridSearchCV.
>> I should be able to generate column_descriptions from parse_description in
>> data.py:
>> https://github.com/westurner/house_prices/blob/develop/
>> house_prices/data.py
>> https://github.com/automl/auto-sklearn looks cool too.
>> ... http://stats.stackexchange.com/questions/181/how-to-choose-the-number-
>> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
>> http://tflearn.org
>>>> Moved the needle a little bit yesterday with a ridge regression attempt
>>>> using the same feature engineering I described before.
>>>> Luke
>>>>> Made a TPOT attempt tonight.  Could only do some numeric features
>>>>> though because including categoricals would cause my ipython kernel to die.
>>>>> I will try a bigger box this weekend
>>>>>>>>> The quick explanation is rather than dropping outliers, I used
>>>>>> numpy's
>>>>>>>>> log1p function to help normalize distribution of the data (for
>>>>>> both the
>>>>>>>>> sale price and for all features over a certain skewness). I was
>>>>>> also
>>>>>>>>> struggling with adding in more features to the model.
>>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l
>>>>>> og1p.html
>>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn.
>>>>>>>> preprocessing.FunctionTransformer.html
>>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic
>>>>>>>> s)#Common_transformations
>>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution
>>>>>>>> How did you determine the skewness threshold?
>>>>>>>> ...
>>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di
>>>>>>>> stribution#Specified_variance:_the_normal_distribution
>>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics)
>>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no
>>>>>> rmalization
>>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we-
>>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network
>>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz
>>>>>> ation-and-
>>>>>>> standardization-in-neural-networks
>>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf
>>>>>> low/contrib/learn/python/learn
>>>>>>>>> The training and test data sets have different "completeness" of
>>>>>> some
>>>>>>>>> features, and using pd.get_dummies can be problematic when you fit
>>>>>> a model
>>>>>>>>> versus predicting if you don't have the same columns/features. I
>>>>>> simply
>>>>>>>>> combined the train and test data sets (without the Id and
>>>>>> SalePrice) and
>>>>>>>>> ran the get_dummies function over that set.
>>>>>>>> autoclean_cv loads the train set first and then applies those
>>>>>>>> categorical/numerical mappings to the test set
>>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
>>>>>>>> When I modify load_house_prices [1] to also load test.csv in order
>>>>>> to
>>>>>>>> autoclean_csv,
>>>>>>>> I might try assigning the categorical levels according to the
>>>>>> ranking in
>>>>>>>> data_description.txt,
>>>>>>>> rather than the happenstance ordering in train.csv;
>>>>>>>> though get_dummies should make that irrelevant.
>>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>>>>>>> e_prices/data.py#L45
>>>>>>>> I should probably also manually specify that 'Id' is the index
>>>>>> column in
>>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should
>>>>>> check
>>>>>>>> for).
>>>>>>>>> When I needed to fit the model, I just "unraveled" the combined
>>>>>> set with
>>>>>>>>> the train and test parts.
>>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
>>>>>>>>>                      test.loc[:,'MSSubClass':'SaleCondition']))
>>>>>>>>> combined = pd.get_dummies(combined)
>>>>>>>>> ::: do some feature engineering :::
>>>>>>>>> trainX = combined[:train.shape[0]]
>>>>>>>>> y = train['SalePrice']
>>>>>>>>> Just so long you don't do anything to the combined dataframe (like
>>>>>>>>> sorting), you can slice off each part based on it's shape.
>>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html#
>>>>>>>> returning-a-view-versus-a-copy
>>>>>>>>> and when you would be pulling the data to predict the test data,
>>>>>> you get
>>>>>>>>> the other part:
>>>>>>>>> testX = combined[train.shape[0]:]
>>>>>>>> Why is the concatenation necessary?
>>>>>>>> - log1p doesn't need the whole column
>>>>>>>> - get_dummies doesn't need the whole column
>>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre
>>>>>> processing.StandardScaler.html
>>>>>> requires the whole column.
>>>>>> (
>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr
>>>>>> eprocessing-scaler
>>>>>> )
>>>>>>>>> Luke
>>>>>>> (Trimmed reply-chain (again) because 40Kb limit)
