[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

Andrew Howe ahowe42 at gmail.com
Tue Dec 27 02:18:42 EST 2016


Hi Debu

"Should I be using 2 different input datasets (completely exclusive /
disjoint) for training and scoring the models ?"  Yes - this is the reason
for partitioning the data into training / testing sets.  However, I can't
imagine that it's the cause of your odd results.  What is the total
classification result in both training & testing (not just TPs)?

Andrew

<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
J. Andrew Howe, PhD
www.andrewhowe.com
http://www.linkedin.com/in/ahowe42
https://www.researchgate.net/profile/John_Howe12/
I live to learn, so I can learn to live. - me
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>

On Tue, Dec 27, 2016 at 8:26 AM, Debabrata Ghosh <mailfordebu at gmail.com>
wrote:

> Hi Joel,
>
>                 Thanks for your quick feedback – I certainly understand
> what you mean and please allow me to explain one more time through a
> sequence of steps corresponding to the approach I followed:
>
>
>
>    1. I considered a dataset containing 600 K (0.6 million) records for
>    training my model using scikit learn’s Random Forest Classifier library
>
>
>
>    1. I did a training and test sample split on 600 k – forming 480 K
>    training dataset and 120 K test dataset (80:20 split)
>
>
>
>    1. I trained scikit learn’s Random Forest Classifier model on the 480
>    K (80% split) training sample
>
>
>
>    1. Then I ran prediction (predict_proba method of scikit learn’s RF
>    library) on the 120 K test sample
>
>
>
>    1. I got a prediction result with True Positive Rate (TPR) as 10-12 %
>    on probability thresholds above 0.5
>
>
>
>    1. I saved the above Random Forest Classifier model using scikit
>    learn’s joblib library (dump method) in the form of a pickle file
>
>
>
>    1. I reloaded the model in a different python instance from the pickle
>    file mentioned above and did my scoring , i.e., used joblib library load
>    method and then instantiated prediction (predict_proba method) on the
>    entire set of my original 600 K records
>
>
>
>    1. Now when I am running (scoring) my model using joblib.predict_proba
>    on the entire set of original data (600 K), I am getting a True Positive
>    rate of around 80%.
>
>
>
>    1. I did some  further analysis and figured out that during the
>    training process, when the model was predicting on the test sample of 120K
>    it could only predict 10-12% of 120K data beyond a probability threshold of
>    0.5. When I am now trying to score my model on the entire set of 600 K
>    records, it appears that the model is remembering some of it’s past
>    behavior and data and accordingly throwing 80% True positive rate
>
>
>
>    1. When I tried to score the model using joblib.predict_proba on a
>    completely disjoint dataset from the one used for training (i.e., no
>    overlap between training and scoring data) then it’s giving me the right
>    True Positive Rate (in the range of 10 – 12%)
>
>           *Here lies my question once again:* Should I be using 2
> different input datasets (completely exclusive / disjoint) for training and
> scoring the models ? In case the input datasets for scoring and training
> overlaps then I get incorrect results. Will that be a fair assumption ?
>
>           Another question – is there an alternate model scoring library
> (apart from joblib, the one I am using) ?
>
>
>          Thanks once again for your feedback in advance !
>
>
> Cheers,
>
>
> Debu
>
> On Tue, Dec 27, 2016 at 1:56 AM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
>
>> Hi Debu,
>>
>> Your post is terminologically confusing, so I'm not sure I've understood
>> your problem. Where is the "different sample" used for scoring coming from?
>> Is it possible it is more related to the training data than the test sample?
>>
>> Joel
>>
>> On 27 December 2016 at 05:28, Debabrata Ghosh <mailfordebu at gmail.com>
>> wrote:
>>
>>> Dear All,
>>>
>>>                                 Greetings!
>>>
>>>                                 I need some urgent guidance and help
>>> from you all in model scoring. What I mean by model scoring is around the
>>> following steps:
>>>
>>>
>>>
>>>    1. I have trained a Random Classifier model using scikit-learn
>>>    (RandomForestClassifier library)
>>>    2. Then I have generated the True Positive and False Positive
>>>    predictions on my test data set using predict_proba method (I have splitted
>>>    my data into training and test samples in 80:20 ratio)
>>>    3. Finally, I have dumped the model into a pkl file.
>>>    4. Next in another instance, I have loaded the .pkl file
>>>    5. I have initiated job_lib.predict_proba method for predicting the
>>>    True Positive and False positives on a different sample. I am terming this
>>>    step as scoring whether I am predicting without retraining the model
>>>
>>>                 My question is when I generate the True Positive Rate
>>> on the test data set (as part of model training approach), the rate which I
>>> am getting is 10 – 12%. But when I do the scoring (using the steps
>>> mentioned above), my True Positive Rate is shooting high upto 80%.
>>> Although, I am happy to get a very high TPR but my question is whether
>>> getting such a high TPR during the scoring phase is an expected outcome? In
>>> other words, whether achieving a high TPR through joblib is an accepted
>>> outcome vis-à-vis getting the TPR on training / test data set.
>>>
>>>                 Your views on the above ask will be really helpful as I
>>> am very confused whether to consider scoring the model using joblib.
>>> Otherwise is there any other alternative to joblib, which can help me to do
>>> scoring without retraining the model. Please let me know as per your
>>> earliest convenience as am a bit pressed
>>>
>>>
>>>
>>> Thanks for your help in advance!
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Debu
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161227/24dcadcb/attachment.html>


More information about the scikit-learn mailing list