[scikit-learn] Heisenbug?

Dan Stromberg dstromberg at grokstream.com
Tue Dec 17 11:08:11 EST 2019


Here are the inputs to _assert_all_finite() on one specific failed run.
They look finite to me:
X:
array([0.6150936 , 0.24652782, 0.8880004 , 0.2016928 , 0.80948585,
       0.10764928, 0.81631166, 0.25909033, 0.9299345 , 0.10186833,
       0.81581795, 0.21659133, 0.8279047 , 0.11432098, 0.7335735 ,
       0.20154186, 0.85112196, 0.17447269, 0.5934462 , 0.3967309 ,
       0.83702815, 0.35380727, 0.75063705, 0.32200715, 0.85112196,
       0.11191818, 0.6814021 , 0.11622761, 0.851942  , 0.1892652 ,
       0.8554932 , 0.17869748], dtype=float32)
allow_nan:
False

On Tue, Dec 17, 2019 at 7:50 AM Dan Stromberg <dstromberg at grokstream.com>
wrote:

>
> Hi.
>
> Overflow does sound kind of possible.  We're sending semi-random values to
> the test.
>
> I believe our systems are all x86_64, Linux.  Some are Ubuntu 16.04, some
> are Mint 19.2.
>
> I realized on the way to work this morning, that I left out some important
> information; I suspect a heisenbug for 3 reasons:
>
> 1) If I try to look at it with print functions, I get a traceback after
> the print's, but no print output.  This happens with both writing to a
> disk-based file, and with printing to stdout.
>
> 2) If I try to look at it with pudb (a debugger) via pudb.set_trace(), I
> get a failure to start pudb.
>
> 3) If I create a small test program that sends the same inputs to the
> function in question, the function works fine.
>
> Thanks.
>
> On Mon, Dec 16, 2019 at 11:20 PM Joel Nothman <joel.nothman at gmail.com>
> wrote:
>
>> Hi Dan, this kind of error can come from overflow. Are all of your test
>> systems the same architecture?
>>
>> On Tue., 17 Dec. 2019, 12:03 pm Dan Stromberg, <dstromberg at grokstream.com>
>> wrote:
>>
>>> Hi folks.
>>>
>>> I'm new to Scikit-learn.
>>>
>>> I have a very large Python project that seems to have a heisenbug which
>>> is manifesting in scikit-learn code.
>>>
>>> Short of constructing an SSCCE, are there any magical techniques I
>>> should try for pinning down the precise cause?  Like valgrind or something?
>>>
>>> An SSCCE will most likely be pretty painful: the project has copious
>>> shared, mutable state, and I've already tried a largish test program that
>>> calls into the same code path with the error manifesting 0 times in 100.
>>>
>>> It's quite possible the root cause will turn out to be some other part
>>> of the software stack.
>>>
>>> The traceback from pytest looks like:
>>> sequential/test_training.py:101:
>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>> _ _ _ _ _ _ _ _ _ _ _ _ _
>>> ../rt/classifier/coach.py:146: in train
>>>     **self.classifier_section
>>> ../domain/classifier/factories/classifier_academy.py:115: in
>>> create_classifier
>>>     **kwargs)
>>> ../domain/classifier/factories/imp/xgb_factory.py:164: in create
>>>     clf_random.fit(X_train, y_train)
>>> ../../../../.local/lib/python3.6/site-packages/sklearn/model_selection/_search.py:722:
>>> in fit
>>>     self._run_search(evaluate_candidates)
>>> ../../../../.local/lib/python3.6/site-packages/sklearn/model_selection/_search.py:1515:
>>> in _run_search
>>>     random_state=self.random_state))
>>> ../../../../.local/lib/python3.6/site-packages/sklearn/model_selection/_search.py:711:
>>> in evaluate_candidates
>>>     cv.split(X, y, groups)))
>>> ../../../../.local/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py:996:
>>> in __call__
>>>     self.retrieve()
>>> ../../../../.local/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py:899:
>>> in retrieve
>>>     self._output.extend(job.get(timeout=self.timeout))
>>> ../../../../.local/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py:517:
>>> in wrap_future_result
>>>     return future.result(timeout=timeout)
>>> /usr/lib/python3.6/concurrent/futures/_base.py:425: in result
>>>     return self.__get_result()
>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>> _ _ _ _ _ _ _ _ _ _ _ _ _
>>>
>>> self = <Future at 0x7f15571ec7f0 state=finished raised ValueError>
>>>
>>>     def __get_result(self):
>>>         if self._exception:
>>> >           raise self._exception
>>> E           ValueError: Input contains NaN, infinity or a value too
>>> large for dtype('float32').
>>>
>>> /usr/lib/python3.6/concurrent/futures/_base.py:384: ValueError
>>>
>>>
>>> The above exception is raised about 12 to 14 times in 100 in full-blown
>>> automated testing.
>>>
>>> Thanks for the cool software.
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191217/ebbc2b6d/attachment-0001.html>


More information about the scikit-learn mailing list