[omaha] Group Data Science Competition

Wes Turner wes.turner at gmail.com
Sun Dec 25 22:41:56 EST 2016


On Sun, Dec 25, 2016 at 7:40 PM, Wes Turner <wes.turner at gmail.com> wrote:

>
>
> On Sunday, December 25, 2016, Bob Haffner <bob.haffner at gmail.com> wrote:
>
>> Merry Christmas, everyone!
>>
>>
> Merry Christmas!
>
>>
>>  Still heading down the TPOT path with limited success.  I get varying
>> scores (tpot.score()) with the same result (kaggle scoring)
>>
>> Any other TPOT users getting inconsistent results?   Specifically with
>> 0.6.7?
>>
>
> There may be variance because of the way TPOT splits X_train into X_train
> and X_test w/ train_size and test_size.
>
> I rewrote load_house_prices as a class w/ a better mccabe cyclomatic
> complexity score with a concatenation step so that X_train and X_test have
> the same columns (in data.py)
>
> It probably makes sense to use scikit-learn for data transformation (e.g.
> OneHotEncoder instead of get_dummies).
>
> https://twitter.com/westurner/status/813011289475842048 :
> """
> . at scikit_learn
> Src: https://t.co/biMt6XRt2T
> Docs: https://t.co/Lb5EYRCdI8
> #API:
> .fit_transform(X, y)
> .fit(X_train, y_train)
> .predict(X_test)
> """
>
> I haven't yet run w/ pd.get_dummies and df.to_sparse; that may or may not
> prevent the oom error.
>
> Looking at https://libraries.io/pypi/xgboost "Dependent Repositories",
> there are a number of scikit-learn-compatible packages for automating
> analysis in addition to TPOT: auto-sklearn, rep.
> auto_ml mentions 12 algos for type_of_estimator='regressor'.
> (and sparse matrices, and other parameters).
>
> https://github.com/ClimbsRocks/auto_ml
>
> http://auto-ml.readthedocs.io/en/latest/
>

Here's a (probably overfitted) auto_ml attempt:
https://github.com/westurner/house_prices/blob/7260ada0c10cf371b33973b0d9af6bca860d0008/house_prices/analysis_auto_ml.py

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard?submissionId=3958857
..."Your submission scored 9.45422, which is not an improvement of your
best score. "

Setting .train(compute_power=10) errored out after a bunch of GridSearchCV.


>
>
> I should be able to generate column_descriptions from parse_description in
> data.py:
> https://github.com/westurner/house_prices/blob/develop/
> house_prices/data.py
>
> https://github.com/automl/auto-sklearn looks cool too.
>
> ... http://stats.stackexchange.com/questions/181/how-to-choose-the-number-
> of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
>
> http://tflearn.org
>
>
>>
>>
>> On Fri, Dec 23, 2016 at 8:03 AM, Luke Schollmeyer <
>> luke.schollmeyer at gmail.com> wrote:
>>
>>> Moved the needle a little bit yesterday with a ridge regression attempt
>>> using the same feature engineering I described before.
>>>
>>> Luke
>>>
>>> On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner <bob.haffner at gmail.com>
>>> wrote:
>>>
>>>> Made a TPOT attempt tonight.  Could only do some numeric features
>>>> though because including categoricals would cause my ipython kernel to die.
>>>>
>>>> I will try a bigger box this weekend
>>>>
>>>> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha <omaha at python.org
>>>> > wrote:
>>>>
>>>>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner <wes.turner at gmail.com>
>>>>> wrote:
>>>>>
>>>>> >
>>>>> >
>>>>> > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner <wes.turner at gmail.com>
>>>>> wrote:
>>>>> >
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer <
>>>>> >> luke.schollmeyer at gmail.com> wrote:
>>>>> >>
>>>>> >>> The quick explanation is rather than dropping outliers, I used
>>>>> numpy's
>>>>> >>> log1p function to help normalize distribution of the data (for
>>>>> both the
>>>>> >>> sale price and for all features over a certain skewness). I was
>>>>> also
>>>>> >>> struggling with adding in more features to the model.
>>>>> >>>
>>>>> >>
>>>>> >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.l
>>>>> og1p.html
>>>>> >> - http://scikit-learn.org/stable/modules/generated/sklearn.
>>>>> >> preprocessing.FunctionTransformer.html
>>>>> >>
>>>>> >>
>>>>> >> https://en.wikipedia.org/wiki/Data_transformation_(statistic
>>>>> >> s)#Common_transformations
>>>>> >>
>>>>> >> https://en.wikipedia.org/wiki/Log-normal_distribution
>>>>> >>
>>>>> >>
>>>>> >> How did you determine the skewness threshold?
>>>>> >>
>>>>> >> ...
>>>>> >>
>>>>> >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di
>>>>> >> stribution#Specified_variance:_the_normal_distribution
>>>>> >>
>>>>> >> https://en.wikipedia.org/wiki/Normalization_(statistics)
>>>>> >>
>>>>> >> http://scikit-learn.org/stable/modules/preprocessing.html#no
>>>>> rmalization
>>>>> >>
>>>>> >
>>>>> > - https://stackoverflow.com/questions/4674623/why-do-we-
>>>>> > have-to-normalize-the-input-for-an-artificial-neural-network
>>>>> > - https://stats.stackexchange.com/questions/7757/data-normaliz
>>>>> ation-and-
>>>>> > standardization-in-neural-networks
>>>>> >
>>>>>
>>>>> https://github.com/tensorflow/tensorflow/tree/master/tensorf
>>>>> low/contrib/learn/python/learn
>>>>>
>>>>>
>>>>> >
>>>>> >
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>> The training and test data sets have different "completeness" of
>>>>> some
>>>>> >>> features, and using pd.get_dummies can be problematic when you fit
>>>>> a model
>>>>> >>> versus predicting if you don't have the same columns/features. I
>>>>> simply
>>>>> >>> combined the train and test data sets (without the Id and
>>>>> SalePrice) and
>>>>> >>> ran the get_dummies function over that set.
>>>>> >>>
>>>>> >>
>>>>> >> autoclean_cv loads the train set first and then applies those
>>>>> >> categorical/numerical mappings to the test set
>>>>> >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
>>>>> >>
>>>>> >> When I modify load_house_prices [1] to also load test.csv in order
>>>>> to
>>>>> >> autoclean_csv,
>>>>> >> I might try assigning the categorical levels according to the
>>>>> ranking in
>>>>> >> data_description.txt,
>>>>> >> rather than the happenstance ordering in train.csv;
>>>>> >> though get_dummies should make that irrelevant.
>>>>> >>
>>>>> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>>>> >> e_prices/data.py#L45
>>>>> >>
>>>>> >> I should probably also manually specify that 'Id' is the index
>>>>> column in
>>>>> >> pd.read_csv (assuming there are no duplicates, which pandas should
>>>>> check
>>>>> >> for).
>>>>> >>
>>>>> >>
>>>>> >>> When I needed to fit the model, I just "unraveled" the combined
>>>>> set with
>>>>> >>> the train and test parts.
>>>>> >>>
>>>>> >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
>>>>> >>>                       test.loc[:,'MSSubClass':'SaleCondition']))
>>>>> >>>
>>>>> >>> combined = pd.get_dummies(combined)
>>>>> >>>
>>>>> >>> ::: do some feature engineering :::
>>>>> >>>
>>>>> >>> trainX = combined[:train.shape[0]]
>>>>> >>> y = train['SalePrice']
>>>>> >>>
>>>>> >>> Just so long you don't do anything to the combined dataframe (like
>>>>> >>> sorting), you can slice off each part based on it's shape.
>>>>> >>>
>>>>> >>
>>>>> >> http://pandas.pydata.org/pandas-docs/stable/indexing.html#
>>>>> >> returning-a-view-versus-a-copy
>>>>> >>
>>>>> >>
>>>>> >>>
>>>>> >>> and when you would be pulling the data to predict the test data,
>>>>> you get
>>>>> >>> the other part:
>>>>> >>>
>>>>> >>> testX = combined[train.shape[0]:]
>>>>> >>>
>>>>> >>
>>>>> >> Why is the concatenation necessary?
>>>>> >> - log1p doesn't need the whole column
>>>>> >> - get_dummies doesn't need the whole column
>>>>> >>
>>>>> >
>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.pre
>>>>> processing.StandardScaler.html
>>>>> requires the whole column.
>>>>>
>>>>> (
>>>>> http://scikit-learn.org/stable/modules/preprocessing.html#pr
>>>>> eprocessing-scaler
>>>>> )
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> >
>>>>> >>
>>>>> >>>
>>>>> >>>
>>>>> >>> Luke
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>
>>>>> > (Trimmed reply-chain (again) because 40Kb limit)
>>>>> >
>>>>> >
>>>>> _______________________________________________
>>>>> Omaha Python Users Group mailing list
>>>>> Omaha at python.org
>>>>> https://mail.python.org/mailman/listinfo/omaha
>>>>> http://www.OmahaPython.org
>>>>>
>>>>
>>>>
>>>
>>


More information about the Omaha mailing list