[scikit-learn] Scikit learn GridSearchCV fit method ValueError Found array with 0 sample

Mon Jul 11 07:16:05 EDT 2016

Hi Maciek,

Thanks for suggestion, I think the problem indeed is related to the
StratifiedKFold because if I use KFold instead the code works fine.
However, if I print StratifiedKFold object it looks fine to me:

sklearn.cross_validation.StratifiedKFold(labels=[ 5.43  8.74  8.1
6.55  7.66  6.52  8.6   7.1   6.4   8.05  7.89  6.68
  8.06  6.17  5.5   7.96  5.78  6.    7.74  5.83  6.51  6.31  6.68  9.22
  6.07  7.06  7.12  8.64  5.72  6.4   7.64  5.74  7.41  6.49  6.81  7.1
  7.66  6.68  7.05  6.28  5.49  6.35  6.9   6.2   7.51  5.65  9.3   5.84
  6.92  5.75  6.92  8.8   7.04  5.81  5.73  5.31  7.13  7.66  6.98  5.93
  8.24  6.96  8.22  7.27  7.34  5.91  5.57  6.5   7.28  6.74  4.92  6.88
  5.8   9.15  6.63  6.37  8.66  6.4 ], n_folds=5, shuffle=False,
random_state=None)

On Fri, Jul 8, 2016 at 10:42 PM, Maciek Wójcikowski
<maciek at wojcikowski.pl> wrote:
> Hi Michał,
>
> What are the class counts in that set? Maybe there is a problem with
> generating stratified subsamples (eg some classes get below 1 sample)?
>
> ----
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> maciek at wojcikowski.pl
>
> 2016-07-08 17:22 GMT+02:00 Michał Nowotka <mmmnow at gmail.com>:
>>
>> Hi,
>>
>> Sorry for cross posting
>>
>> (http://stackoverflow.com/questions/38263933/scikit-learn-gridsearchcv-fit-method-valueerror-found-array-with-0-sample)
>> but I don't know where is better to get help with my problem.
>> I'm working on a VM with Jupyter notebook server installed.
>> From time to time I add new notebooks and reevaluate old ones to see
>> if they still work.
>>
>> This notebook stopped working due to some changes in scikit-learn API
>> and some parameters become obsolete:
>>
>>
>> https://github.com/chembl/mychembl/blob/master/ipython_notebooks/10_myChEMBL_machine_learning.ipynb
>>
>> I've created a corrected version of the notebook here:
>>
>> https://gist.github.com/anonymous/676c55cc501ffa48fecfcc1e1252d433
>>
>> But I'm stuck in cell 36 on this code:
>>
>> from sklearn.cross_validation import KFold
>> from sklearn.grid_search import GridSearchCV
>>
>> X_traina, X_testa, y_traina, y_testa =
>> cross_validation.train_test_split(x, y, test_size=0.95,
>> random_state=23)
>>
>> params = {'min_samples_split': [8], 'max_depth': [20],
>> 'min_samples_leaf': [1],'n_estimators':[200]}
>> cv = KFold(n=len(X_traina),n_folds=10,shuffle=True)
>> cv_stratified = StratifiedKFold(y_traina, n_folds=5)
>> gs = GridSearchCV(custom_forest, params,
>> cv=cv_stratified,verbose=1,refit=True)
>> gs.fit(X_traina,y_traina)
>>
>> This gives me:
>>
>> ValueError: Found array with 0 sample(s) (shape=(0, 491)) while a
>> minimum of 1 is required.
>>
>> Now I don't understand this because when I print shapes of the samples:
>>
>> print (X_traina.shape, X_testa.shape, y_traina.shape, y_testa.shape)
>>
>> I'm getting:
>>
>> ((78, 491), (1489, 491), (78,), (1489,))
>>
>> Interestingly, if I change the test_size parameter to 0.88 (like in
>> the example corrected notebook) it works and this is the highest value
>> where it works. For this value, the shapes are:
>>
>> ((188, 491), (1379, 491), (188,), (1379,))
>>
>> So the question is - what should I change in my code to make it work
>> for test_size set to 0.95 as well?
>>
>> Kind regards,
>>
>> Michal Nowotka
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>