[omaha] Group Data Science Competition

Luke Schollmeyer luke.schollmeyer at gmail.com
Fri Dec 23 09:03:00 EST 2016


Moved the needle a little bit yesterday with a ridge regression attempt
using the same feature engineering I described before.

Luke

On Thu, Dec 22, 2016 at 8:47 PM, Bob Haffner <bob.haffner at gmail.com> wrote:

> Made a TPOT attempt tonight.  Could only do some numeric features though
> because including categoricals would cause my ipython kernel to die.
>
> I will try a bigger box this weekend
>
> On Wed, Dec 21, 2016 at 2:14 PM, Wes Turner via Omaha <omaha at python.org>
> wrote:
>
>> On Wed, Dec 21, 2016 at 2:11 PM, Wes Turner <wes.turner at gmail.com> wrote:
>>
>> >
>> >
>> > On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner <wes.turner at gmail.com>
>> wrote:
>> >
>> >>
>> >>
>> >> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer <
>> >> luke.schollmeyer at gmail.com> wrote:
>> >>
>> >>> The quick explanation is rather than dropping outliers, I used numpy's
>> >>> log1p function to help normalize distribution of the data (for both
>> the
>> >>> sale price and for all features over a certain skewness). I was also
>> >>> struggling with adding in more features to the model.
>> >>>
>> >>
>> >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html
>> >> - http://scikit-learn.org/stable/modules/generated/sklearn.
>> >> preprocessing.FunctionTransformer.html
>> >>
>> >>
>> >> https://en.wikipedia.org/wiki/Data_transformation_(statistic
>> >> s)#Common_transformations
>> >>
>> >> https://en.wikipedia.org/wiki/Log-normal_distribution
>> >>
>> >>
>> >> How did you determine the skewness threshold?
>> >>
>> >> ...
>> >>
>> >> https://en.wikipedia.org/wiki/Maximum_entropy_probability_di
>> >> stribution#Specified_variance:_the_normal_distribution
>> >>
>> >> https://en.wikipedia.org/wiki/Normalization_(statistics)
>> >>
>> >> http://scikit-learn.org/stable/modules/preprocessing.html#
>> normalization
>> >>
>> >
>> > - https://stackoverflow.com/questions/4674623/why-do-we-
>> > have-to-normalize-the-input-for-an-artificial-neural-network
>> > - https://stats.stackexchange.com/questions/7757/data-normaliz
>> ation-and-
>> > standardization-in-neural-networks
>> >
>>
>> https://github.com/tensorflow/tensorflow/tree/master/tensorf
>> low/contrib/learn/python/learn
>>
>>
>> >
>> >
>> >>
>> >>
>> >>
>> >>
>> >>> The training and test data sets have different "completeness" of some
>> >>> features, and using pd.get_dummies can be problematic when you fit a
>> model
>> >>> versus predicting if you don't have the same columns/features. I
>> simply
>> >>> combined the train and test data sets (without the Id and SalePrice)
>> and
>> >>> ran the get_dummies function over that set.
>> >>>
>> >>
>> >> autoclean_cv loads the train set first and then applies those
>> >> categorical/numerical mappings to the test set
>> >> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
>> >>
>> >> When I modify load_house_prices [1] to also load test.csv in order to
>> >> autoclean_csv,
>> >> I might try assigning the categorical levels according to the ranking
>> in
>> >> data_description.txt,
>> >> rather than the happenstance ordering in train.csv;
>> >> though get_dummies should make that irrelevant.
>> >>
>> >> https://github.com/westurner/house_prices/blob/2839ff8a/hous
>> >> e_prices/data.py#L45
>> >>
>> >> I should probably also manually specify that 'Id' is the index column
>> in
>> >> pd.read_csv (assuming there are no duplicates, which pandas should
>> check
>> >> for).
>> >>
>> >>
>> >>> When I needed to fit the model, I just "unraveled" the combined set
>> with
>> >>> the train and test parts.
>> >>>
>> >>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
>> >>>                       test.loc[:,'MSSubClass':'SaleCondition']))
>> >>>
>> >>> combined = pd.get_dummies(combined)
>> >>>
>> >>> ::: do some feature engineering :::
>> >>>
>> >>> trainX = combined[:train.shape[0]]
>> >>> y = train['SalePrice']
>> >>>
>> >>> Just so long you don't do anything to the combined dataframe (like
>> >>> sorting), you can slice off each part based on it's shape.
>> >>>
>> >>
>> >> http://pandas.pydata.org/pandas-docs/stable/indexing.html#
>> >> returning-a-view-versus-a-copy
>> >>
>> >>
>> >>>
>> >>> and when you would be pulling the data to predict the test data, you
>> get
>> >>> the other part:
>> >>>
>> >>> testX = combined[train.shape[0]:]
>> >>>
>> >>
>> >> Why is the concatenation necessary?
>> >> - log1p doesn't need the whole column
>> >> - get_dummies doesn't need the whole column
>> >>
>> >
>> http://scikit-learn.org/stable/modules/generated/sklearn.
>> preprocessing.StandardScaler.html
>> requires the whole column.
>>
>> (
>> http://scikit-learn.org/stable/modules/preprocessing.html#
>> preprocessing-scaler
>> )
>>
>>
>>
>>
>> >
>> >>
>> >>>
>> >>>
>> >>> Luke
>> >>>
>> >>>
>> >>>
>> >>
>> > (Trimmed reply-chain (again) because 40Kb limit)
>> >
>> >
>> _______________________________________________
>> Omaha Python Users Group mailing list
>> Omaha at python.org
>> https://mail.python.org/mailman/listinfo/omaha
>> http://www.OmahaPython.org
>>
>
>


More information about the Omaha mailing list