[omaha] Group Data Science Competition
Jeremy Doyle
uiab1638 at yahoo.com
Wed Dec 28 02:08:48 EST 2016
Sent from my iPhone
> On Dec 25, 2016, at 9:41 PM, Wes Turner via Omaha <omaha at python.org> wrote:
>
>> On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner <wes.turner at gmail.com> wrote:
>>
>>
>>
>>> On Sunday, December 25, 2016, Bob Haffner <bob.haffner at gmail.com> wrote:
>>>
>>> Merry Christmas, everyone!
>>>
>>>
>> Merry Christmas!
>>
>>>
>>> Still heading down the TPOT path with limited success. I get varying
>>> scores (tpot.score()) with the same result (kaggle scoring)
>>>
>>> Any other TPOT users getting inconsistent results? Specifically with
>>> 0.6.7?
>>>
>>
>> There may be variance because of the way TPOT splits X_train into X_train
>> and X_test w/ train_size and test_size.
>>
>> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic
>> complexity score with a concatenation step so that X_train and X_test have
>> the same columns (in data.py)
>>
>> It probably makes sense to use scikit-learn for data transformation (e.g.
>> OneHotEncoder instead of get_dummies).
>>
>> https://twitter.com/westurner/status/813011289475842048 :
>> """
>> . at scikit_learn
>> Src: https://t.co/biMt6XRt2T
>> Docs: https://t.co/Lb5EYRCdI8
>> #API:
>> .fit_transform(X, y)
>> .fit(X_train, y_train)
>> .predict(X_test)
>> """
>>
>> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not
>> prevent the oom error.
>>
>> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories",
>> there are a number of scikit-learn-compatible packages for automating
>> analysis in addition to TPOT: auto-sklearn, rep.
>> auto_ml mentions 12 algos for type_of_estimator='regressor'.
>> (and sparse matrices, and other parameters).
>>
>> https://github.com/ClimbsRocks/auto_ml
>>
>> http://auto-ml.readthedocs.io/en/latest/
>>
>
> Here's a (probably overfitted) auto_ml attempt:
> https://github.com/westurner/house_prices/blob/7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py
>
> https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3958857
> ..."Your submission scored 9.45422, which is not an improvement of your
> best score. "
>
> Setting .train(compute_power=10) errored out after a bunch of GridSearchCV.
>
>
>>
>>
>> I should be able to generate column_descriptions from parse_description in
>> data.py:
>> https://github.com/westurner/house_prices/blob/develop/
>> house_prices/data.py
>>
>> https://github.com/automl/auto-sklearn looks cool too.
>>
>> ... http://stats.stackexchange.com/questions/181/how-to-choose-the-number-
>> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
>>
>> http://tflearn.org
>>
>>
>>>
>>>
>>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer <
>>> luke.schollmeyer at gmail.com> wrote:
>>>
>>>> Moved the needle a little bit yesterday with a ridge regression attempt
>>>> using the same feature engineering I described before.
>>>>
>>>> Luke
>>>>
>>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner <bob.haffner at gmail.com>
>>>> wrote:
>>>>
>>>>> Made a TPOT attempt tonight. Could only do some numeric features
>>>>> though because including categoricals would cause my ipython kernel to die.
>>>>>
>>>>> I will try a bigger box this weekend
>>>>>
>>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha <omaha at python.org
>>>>>> wrote:
>>>>>
>>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner <wes.turner at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner <wes.turner at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer <
>>>>>>>> luke.schollmeyer at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The quick explanation is rather than dropping outliers, I used
>>>>>> numpy's
>>>>>>>>> log1p function to help normalize distribution of the data (for
>>>>>> both the
>>>>>>>>> sale price and for all features over a certain skewness). I was
>>>>>> also
>>>>>>>>> struggling with adding in more features to the model.
>>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l
>>>>>> og1p.html
>>>>>>>> - http://scikit-learn.org/stable/modules/generated/sklearn.
>>>>>>>> preprocessing.FunctionTransformer.html
>>>>>>>>
>>>>>>>>
>>>>>>>> https://en.wikipedia.org/wiki/Data_transformation_(statistic
>>>>>>>> s)#Common_transformations
>>>>>>>>
>>>>>>>> https://en.wikipedia.org/wiki/Log-normal_distribution
>>>>>>>>
>>>>>>>>
>>>>>>>> How did you determine the skewness threshold?
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di
>>>>>>>> stribution#Specified_variance:_the_normal_distribution
>>>>>>>>
>>>>>>>> https://en.wikipedia.org/wiki/Normalization_(statistics)
>>>>>>>>
>>>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#no
>>>>>> rmalization
>>>>>>>>
>>>>>>>
>>>>>>> - https://stackoverflow.com/questions/4674623/why-do-we-
>>>>>>> have-to-normalize-the-input-for-an-artificial-neural-network
>>>>>>> - https://stats.stackexchange.com/questions/7757/data-normaliz
>>>>>> ation-and-
>>>>>>> standardization-in-neural-networks
>>>>>>>
>>>>>>
>>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf
>>>>>> low/contrib/learn/python/learn
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> The training and test data sets have different "completeness" of
>>>>>> some
>>>>>>>>> features, and using pd.get_dummies can be problematic when you fit
>>>>>> a model
>>>>>>>>> versus predicting if you don't have the same columns/features. I
>>>>>> simply
>>>>>>>>> combined the train and test data sets (without the Id and
>>>>>> SalePrice) and
>>>>>>>>> ran the get_dummies function over that set.
>>>>>>>>>
>>>>>>>>
>>>>>>>> autoclean_cv loads the train set first and then applies those
>>>>>>>> categorical/numerical mappings to the test set
>>>>>>>> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
>>>>>>>>
>>>>>>>> When I modify load_house_prices [1] to also load test.csv in order
>>>>>> to
>>>>>>>> autoclean_csv,
>>>>>>>> I might try assigning the categorical levels according to the
>>>>>> ranking in
>>>>>>>> data_description.txt,
>>>>>>>> rather than the happenstance ordering in train.csv;
>>>>>>>> though get_dummies should make that irrelevant.
>>>>>>>>
>>>>>>>> https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>>>>>>> e_prices/data.py#L45
>>>>>>>>
>>>>>>>> I should probably also manually specify that 'Id' is the index
>>>>>> column in
>>>>>>>> pd.read_csv (assuming there are no duplicates, which pandas should
>>>>>> check
>>>>>>>> for).
>>>>>>>>
>>>>>>>>
>>>>>>>>> When I needed to fit the model, I just "unraveled" the combined
>>>>>> set with
>>>>>>>>> the train and test parts.
>>>>>>>>>
>>>>>>>>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
>>>>>>>>> test.loc[:,'MSSubClass':'SaleCondition']))
>>>>>>>>>
>>>>>>>>> combined = pd.get_dummies(combined)
>>>>>>>>>
>>>>>>>>> ::: do some feature engineering :::
>>>>>>>>>
>>>>>>>>> trainX = combined[:train.shape[0]]
>>>>>>>>> y = train['SalePrice']
>>>>>>>>>
>>>>>>>>> Just so long you don't do anything to the combined dataframe (like
>>>>>>>>> sorting), you can slice off each part based on it's shape.
>>>>>>>>>
>>>>>>>>
>>>>>>>> http://pandas.pydata.org/pandas-docs/stable/indexing.html#
>>>>>>>> returning-a-view-versus-a-copy
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> and when you would be pulling the data to predict the test data,
>>>>>> you get
>>>>>>>>> the other part:
>>>>>>>>>
>>>>>>>>> testX = combined[train.shape[0]:]
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why is the concatenation necessary?
>>>>>>>> - log1p doesn't need the whole column
>>>>>>>> - get_dummies doesn't need the whole column
>>>>>>>>
>>>>>>>
>>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre
>>>>>> processing.StandardScaler.html
>>>>>> requires the whole column.
>>>>>>
>>>>>> (
>>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr
>>>>>> eprocessing-scaler
>>>>>> )
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Luke
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>> (Trimmed reply-chain (again) because 40Kb limit)
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Omaha Python Users Group mailing list
>>>>>> Omaha at python.org
>>>>>> https://mail.python.org/mailman/listinfo/omaha
>>>>>> http://www.OmahaPython.org
>>>>>>
>>>>>
>>>>>
>>>>
>>>
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> https://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
More information about the Omaha
mailing list