[Numpy-discussion] Fwd: Multi-distribution Linux wheels - please test

Matthew Brett matthew.brett at gmail.com
Tue Feb 9 14:40:17 EST 2016


On Tue, Feb 9, 2016 at 11:37 AM, Julian Taylor
<jtaylor.debian at googlemail.com> wrote:
> On 09.02.2016 04:59, Nathaniel Smith wrote:
>> On Mon, Feb 8, 2016 at 6:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>> On Mon, Feb 8, 2016 at 6:04 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>>> On Mon, Feb 8, 2016 at 5:26 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>>> On Mon, Feb 8, 2016 at 4:37 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>>>> [...]
>>>>>> I can't replicate the segfault with manylinux wheels and scipy.  On
>>>>>> the other hand, I get a new test error for numpy from manylinux, scipy
>>>>>> from manylinux, like this:
>>>>>>
>>>>>> $ python -c 'import scipy.linalg; scipy.linalg.test()'
>>>>>>
>>>>>> ======================================================================
>>>>>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, 4))
>>>>>> ----------------------------------------------------------------------
>>>>>> Traceback (most recent call last):
>>>>>>   File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line
>>>>>> 197, in runTest
>>>>>>     self.test(*self.arg)
>>>>>>   File "/usr/local/lib/python2.7/dist-packages/scipy/linalg/tests/test_decomp.py",
>>>>>> line 658, in eigenhproblem_general
>>>>>>     assert_array_almost_equal(diag2_, ones(diag2_.shape[0]), DIGITS[dtype])
>>>>>>   File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py",
>>>>>> line 892, in assert_array_almost_equal
>>>>>>     precision=decimal)
>>>>>>   File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py",
>>>>>> line 713, in assert_array_compare
>>>>>>     raise AssertionError(msg)
>>>>>> AssertionError:
>>>>>> Arrays are not almost equal to 4 decimals
>>>>>>
>>>>>> (mismatch 100.0%)
>>>>>>  x: array([ 0.,  0.,  0.], dtype=float32)
>>>>>>  y: array([ 1.,  1.,  1.])
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> Ran 1507 tests in 14.928s
>>>>>>
>>>>>> FAILED (KNOWNFAIL=4, SKIP=1, failures=1)
>>>>>>
>>>>>> This is a very odd error, which we don't get when running over a numpy
>>>>>> installed from source, linked to ATLAS, and doesn't happen when
>>>>>> running the tests via:
>>>>>>
>>>>>> nosetests /usr/local/lib/python2.7/dist-packages/scipy/linalg
>>>>>>
>>>>>> So, something about the copy of numpy (linked to openblas) is
>>>>>> affecting the results of scipy (also linked to openblas), and only
>>>>>> with a particular environment / test order.
>>>>>>
>>>>>> If you'd like to try and see whether y'all can do a better job of
>>>>>> debugging than me:
>>>>>>
>>>>>> # Run this script inside a docker container started with this incantation:
>>>>>> # docker run -ti --rm ubuntu:12.04 /bin/bash
>>>>>> apt-get update
>>>>>> apt-get install -y python curl
>>>>>> apt-get install libpython2.7  # this won't be necessary with next
>>>>>> iteration of manylinux wheel builds
>>>>>> curl -LO https://bootstrap.pypa.io/get-pip.py
>>>>>> python get-pip.py
>>>>>> pip install -f https://nipy.bic.berkeley.edu/manylinux numpy scipy nose
>>>>>> python -c 'import scipy.linalg; scipy.linalg.test()'
>>>>>
>>>>> I just tried this and on my laptop it completed without error.
>>>>>
>>>>> Best guess is that we're dealing with some memory corruption bug
>>>>> inside openblas, so it's getting perturbed by things like exactly what
>>>>> other calls to openblas have happened (which is different depending on
>>>>> whether numpy is linked to openblas), and which core type openblas has
>>>>> detected.
>>>>>
>>>>> On my laptop, which *doesn't* show the problem, running with
>>>>> OPENBLAS_VERBOSE=2 says "Core: Haswell".
>>>>>
>>>>> Guess the next step is checking what core type the failing machines
>>>>> use, and running valgrind... anyone have a good valgrind suppressions
>>>>> file?
>>>>
>>>> My machine (which does give the failure) gives
>>>>
>>>> Core: Core2
>>>>
>>>> with OPENBLAS_VERBOSE=2
>>>
>>> Yep, that allows me to reproduce it:
>>>
>>> root at f7153f0cc841:/# OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=Core2 python
>>> -c 'import scipy.linalg; scipy.linalg.test()'
>>> Core: Core2
>>> [...]
>>> ======================================================================
>>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, 4))
>>> ----------------------------------------------------------------------
>>> [...]
>>>
>>> So this is indeed sounding like an OpenBLAS issue... next stop
>>> valgrind, I guess :-/
>>
>> Here's the valgrind output:
>>   https://gist.github.com/njsmith/577d028e79f0a80d2797
>>
>> There's a lot of it, but no smoking guns have jumped out at me :-/
>>
>> -n
>>
>
> plenty of smoking guns, e.g.:
>
> .............==3695== Invalid read of size 8
> 3417    ==3695==    at 0x7AAA9C0: daxpy_k_CORE2 (in
> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
> 3418    ==3695==    by 0x76BEEFC: ger_kernel (in
> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
> 3419    ==3695==    by 0x788F618: exec_blas (in
> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
> 3420    ==3695==    by 0x76BF099: dger_thread (in
> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
> 3421    ==3695==    by 0x767DC37: dger_ (in
> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
>
>
> I think I have reported that to openblas already, they said do that
> intentionally, though last I checked they are missing the code that
> verifies this is actually allowed (if your not crossing a page you can
> read beyond the boundaries). Its pretty likely its a pointless micro
> optimization, you normally only use that trick for string functions
> where you don't know the size of the string.
>
> Your code also indicates it ran on core2, while the issues occur on
> sandybridge, maybe valgrind messes with the cpu detection so it won't
> show anything.

Julian - thanks for having a look.  Do you happen to remember the
openblas issue number for this?

Was there an obvious place we could patch openblas to avoid this error
in particular?

Cheers,

Matthew



More information about the NumPy-Discussion mailing list