[omaha] Group Data Science Competition

Wed Dec 28 02:08:48 EST 2016


Sent from my iPhone

> On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha <omaha at python.org> wrote:
> 
>> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner <wes.turner at gmail.com> wrote:
>> 
>> 
>> 
>>> On Sunday, December 25, 2016, Bob Haffner <bob.haffner at gmail.com> wrote:
>>> 
>>> Merry Christmas, everyone!
>>> 
>>> 
>> Merry Christmas!
>> 
>>> 
>>> Still heading down the TPOT path with limited success.  I get varying
>>> scores (tpot.score()) with the same result (kaggle scoring)
>>> 
>>> Any other TPOT users getting inconsistent results?   Specifically with
>>> 0.6.7?
>>> 
>> 
>> There may be variance because of the way TPOT splits X_train into X_train
>> and X_test w/ train_size and test_size.
>> 
>> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic
>> complexity score with a concatenation step so that X_train and X_test have
>> the same columns (in data.py)
>> 
>> It probably makes sense to use scikit-learn for data transformation (e.g.
>> OneHotEncoder instead of get_dummies).
>> 
>> https://twitter.com/westurner/status/813011289475842048 :
>> """
>> . at scikit_learn
>> Src: https://t.co/biMt6XRt2T
>> Docs: https://t.co/Lb5EYRCdI8
>> #API:
>> .fit_transform(X, y)
>> .fit(X_train, y_train)
>> .predict(X_test)
>> """
>> 
>> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not
>> prevent the oom error.
>> 
>> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories",
>> there are a number of scikit-learn-compatible packages for automating
>> analysis in addition to TPOT: auto-sklearn, rep.
>> auto_ml mentions 12 algos for type_of_estimator='regressor'.
>> (and sparse matrices, and other parameters).
>> 
>> https://github.com/ClimbsRocks/auto_ml
>> 
>> http://auto-ml.readthedocs.io/en/latest/
>> 
> 
> Here's a (probably overfitted) auto_ml attempt:
> https://github.com/westurner/house_prices/blob/7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py
> 
> https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3958857
> ..."Your submission scored 9.45422, which is not an improvement of your
> best score. "
> 
> Setting .train(compute_power=10) errored out after a bunch of GridSearchCV.
> 
> 
>> 
>> 
>> I should be able to generate column_descriptions from parse_description in
>> data.py:
>> https://github.com/westurner/house_prices/blob/develop/
>> house_prices/data.py
>> 
>> https://github.com/automl/auto-sklearn looks cool too.
>> 
>> ... http://stats.stackexchange.com/questions/181/how-to-choose-the-number-
>> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
>> 
>> http://tflearn.org
>> 
>> 
>>> 
>>> 
>>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer <
>>> luke.schollmeyer at gmail.com> wrote:
>>> 
>>>> Moved the needle a little bit yesterday with a ridge regression attempt
>>>> using the same feature engineering I described before.
>>>> 
>>>> Luke
>>>> 
>>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner <bob.haffner at gmail.com>
>>>> wrote:
>>>> 
>>>>> Made a TPOT attempt tonight.  Could only do some numeric features
>>>>> though because including categoricals would cause my ipython kernel to die.
>>>>> 
>>>>> I will try a bigger box this weekend
>>>>> 
>>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha <omaha at python.org
>>>>>> wrote:
>>>>> 
>>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner <wes.turner at gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner <wes.turner at gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer <
>>>>>>>> luke.schollmeyer at gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> The quick explanation is rather than dropping outliers, I used
>>>>>> numpy's
>>>>>>>>> log1p function to help normalize distribution of the data (for
>>>>>> both the
>>>>>>>>> sale price and for all features over a certain skewness). I was
>>>>>> also
>>>>>>>>> struggling with adding in more features to the model.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l
>>>>>> og1p.html
>>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn.
>>>>>>>> preprocessing.FunctionTransformer.html
>>>>>>>> 
>>>>>>>> 
>>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic
>>>>>>>> s)#Common_transformations
>>>>>>>> 
>>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution
>>>>>>>> 
>>>>>>>> 
>>>>>>>> How did you determine the skewness threshold?
>>>>>>>> 
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di
>>>>>>>> stribution#Specified_variance:_the_normal_distribution
>>>>>>>> 
>>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics)
>>>>>>>> 
>>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no
>>>>>> rmalization
>>>>>>>> 
>>>>>>> 
>>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we-
>>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network
>>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz
>>>>>> ation-and-
>>>>>>> standardization-in-neural-networks
>>>>>>> 
>>>>>> 
>>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf
>>>>>> low/contrib/learn/python/learn
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> The training and test data sets have different "completeness" of
>>>>>> some
>>>>>>>>> features, and using pd.get_dummies can be problematic when you fit
>>>>>> a model
>>>>>>>>> versus predicting if you don't have the same columns/features. I
>>>>>> simply
>>>>>>>>> combined the train and test data sets (without the Id and
>>>>>> SalePrice) and
>>>>>>>>> ran the get_dummies function over that set.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> autoclean_cv loads the train set first and then applies those
>>>>>>>> categorical/numerical mappings to the test set
>>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
>>>>>>>> 
>>>>>>>> When I modify load_house_prices [1] to also load test.csv in order
>>>>>> to
>>>>>>>> autoclean_csv,
>>>>>>>> I might try assigning the categorical levels according to the
>>>>>> ranking in
>>>>>>>> data_description.txt,
>>>>>>>> rather than the happenstance ordering in train.csv;
>>>>>>>> though get_dummies should make that irrelevant.
>>>>>>>> 
>>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>>>>>>> e_prices/data.py#L45
>>>>>>>> 
>>>>>>>> I should probably also manually specify that 'Id' is the index
>>>>>> column in
>>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should
>>>>>> check
>>>>>>>> for).
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> When I needed to fit the model, I just "unraveled" the combined
>>>>>> set with
>>>>>>>>> the train and test parts.
>>>>>>>>> 
>>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
>>>>>>>>>                      test.loc[:,'MSSubClass':'SaleCondition']))
>>>>>>>>> 
>>>>>>>>> combined = pd.get_dummies(combined)
>>>>>>>>> 
>>>>>>>>> ::: do some feature engineering :::
>>>>>>>>> 
>>>>>>>>> trainX = combined[:train.shape[0]]
>>>>>>>>> y = train['SalePrice']
>>>>>>>>> 
>>>>>>>>> Just so long you don't do anything to the combined dataframe (like
>>>>>>>>> sorting), you can slice off each part based on it's shape.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html#
>>>>>>>> returning-a-view-versus-a-copy
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> and when you would be pulling the data to predict the test data,
>>>>>> you get
>>>>>>>>> the other part:
>>>>>>>>> 
>>>>>>>>> testX = combined[train.shape[0]:]
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Why is the concatenation necessary?
>>>>>>>> - log1p doesn't need the whole column
>>>>>>>> - get_dummies doesn't need the whole column
>>>>>>>> 
>>>>>>> 
>>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre
>>>>>> processing.StandardScaler.html
>>>>>> requires the whole column.
>>>>>> 
>>>>>> (
>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr
>>>>>> eprocessing-scaler
>>>>>> )
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Luke
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> (Trimmed reply-chain (again) because 40Kb limit)
>>>>>>> 
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> Omaha Python Users Group mailing list
>>>>>> Omaha at python.org
>>>>>> https://mail.python.org/mailman/listinfo/omaha
>>>>>> http://www.OmahaPython.org
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org