[omaha] Group Data Science Competition

Wed Dec 21 15:11:23 EST 2016

On Wed, Dec 21, 2016 at 1:41 PM, Wes Turner <wes.turner at gmail.com> wrote:

>
>
> On Wed, Dec 21, 2016 at 1:06 PM, Luke Schollmeyer <
> luke.schollmeyer at gmail.com> wrote:
>
>> The quick explanation is rather than dropping outliers, I used numpy's
>> log1p function to help normalize distribution of the data (for both the
>> sale price and for all features over a certain skewness). I was also
>> struggling with adding in more features to the model.
>>
>
> https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html
> - http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.
> FunctionTransformer.html
>
>
> https://en.wikipedia.org/wiki/Data_transformation_(statistics)#Common_
> transformations
>
> https://en.wikipedia.org/wiki/Log-normal_distribution
>
>
> How did you determine the skewness threshold?
>
> ...
>
> https://en.wikipedia.org/wiki/Maximum_entropy_probability_
> distribution#Specified_variance:_the_normal_distribution
>
> https://en.wikipedia.org/wiki/Normalization_(statistics)
>
> http://scikit-learn.org/stable/modules/preprocessing.html#normalization
>

-
https://stackoverflow.com/questions/4674623/why-do-we-have-to-normalize-the-input-for-an-artificial-neural-network
-
https://stats.stackexchange.com/questions/7757/data-normalization-and-standardization-in-neural-networks

>
>
>
>
>> The training and test data sets have different "completeness" of some
>> features, and using pd.get_dummies can be problematic when you fit a model
>> versus predicting if you don't have the same columns/features. I simply
>> combined the train and test data sets (without the Id and SalePrice) and
>> ran the get_dummies function over that set.
>>
>
> autoclean_cv loads the train set first and then applies those
> categorical/numerical mappings to the test set
> https://github.com/rhiever/datacleaner#datacleaner-in-scripts
>
> When I modify load_house_prices [1] to also load test.csv in order to
> autoclean_csv,
> I might try assigning the categorical levels according to the ranking in
> data_description.txt,
> rather than the happenstance ordering in train.csv;
> though get_dummies should make that irrelevant.
>
> https://github.com/westurner/house_prices/blob/2839ff8a/
> house_prices/data.py#L45
>
> I should probably also manually specify that 'Id' is the index column in
> pd.read_csv (assuming there are no duplicates, which pandas should check
> for).
>
>
>> When I needed to fit the model, I just "unraveled" the combined set with
>> the train and test parts.
>>
>> combined = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
>>                       test.loc[:,'MSSubClass':'SaleCondition']))
>>
>> combined = pd.get_dummies(combined)
>>
>> ::: do some feature engineering :::
>>
>> trainX = combined[:train.shape[0]]
>> y = train['SalePrice']
>>
>> Just so long you don't do anything to the combined dataframe (like
>> sorting), you can slice off each part based on it's shape.
>>
>
> http://pandas.pydata.org/pandas-docs/stable/indexing.
> html#returning-a-view-versus-a-copy
>
>
>>
>> and when you would be pulling the data to predict the test data, you get
>> the other part:
>>
>> testX = combined[train.shape[0]:]
>>
>
> Why is the concatenation necessary?
> - log1p doesn't need the whole column
> - get_dummies doesn't need the whole column
>
>
>>
>>
>> Luke
>>
>>
>>
>
(Trimmed reply-chain (again) because 40Kb limit)