[Numpy-discussion] Matrix rank default tolerance - is it too low?

Tue Jun 26 18:41:12 EDT 2012

>On Sat, Jun 16, 2012 at 4:39 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> On Sat, Jun 16, 2012 at 9:03 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>> Hi,
>>>
>>> On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>>> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris
>>>> <charlesr.harris at gmail.com> wrote:
>>>>>
>>>>>
>>>>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for
>>>>>> matrices that are numerically rank deficient:
>>>>>>
>>>>>> If I repeatedly make random matrices, then set the first column to be
>>>>>> equal to the sum of the second and third columns:
>>>>>>
>>>>>> def make_deficient():
>>>>>>    X = np.random.normal(size=(40, 10))
>>>>>>    deficient_X = X.copy()
>>>>>>    deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>>>>>>    return deficient_X
>>>>>>
>>>>>> then the current numpy.linalg.matrix_rank algorithm returns full rank
>>>>>> (10) in about 8 percent of cases (see appended script).
>>>>>>
>>>>>> I think this is a tolerance problem.  The ``matrix_rank`` algorithm
>>>>>> does this by default:
>>>>>>
>>>>>> S = spl.svd(M, compute_uv=False)
>>>>>> tol = S.max() * np.finfo(S.dtype).eps
>>>>>> return np.sum(S > tol)
>>>>>>
>>>>>> I guess we'd we want the lowest tolerance that nearly always or always
>>>>>> identifies numerically rank deficient matrices.  I suppose one way of
>>>>>> looking at whether the tolerance is in the right range is to compare
>>>>>> the calculated tolerance (``tol``) to the minimum singular value
>>>>>> (``S.min()``) because S.min() in our case should be very small and
>>>>>> indicate the rank deficiency. The mean value of tol / S.min() for the
>>>>>> current algorithm, across many iterations, is about 2.8.  We might
>>>>>> hope this value would be higher than 1, but not much higher, otherwise
>>>>>> we might be rejecting too many columns.
>>>>>>
>>>>>> Our current algorithm for tolerance is the same as the 2-norm of M *
>>>>>> eps.  We're citing Golub and Van Loan for this, but now I look at our
>>>>>> copy (p 261, last para) - they seem to be suggesting using u * |M|
>>>>>> where u = (p 61, section 2.4.2) eps /  2. (see [1]). I think the Golub
>>>>>> and Van Loan suggestion corresponds to:
>>>>>>
>>>>>> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
>>>>>>
>>>>>> This tolerance gives full rank for these rank-deficient matrices in
>>>>>> about 39 percent of cases (tol / S.min() ratio of 1.7)
>>>>>>
>>>>>> We see on p 56 (section 2.3.2) that:
>>>>>>
>>>>>> m, n = M.shape
>>>>>> 1 / sqrt(n) . |M|_{inf} <= |M|_2
>>>>>>
>>>>>> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).  Setting:
>>>>>>
>>>>>> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
>>>>>>
>>>>>> gives about 0.5 percent error (tol / S.min() of 4.4)
>>>>>>
>>>>>> Using the Mathworks threshold [2]:
>>>>>>
>>>>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
>>>>>>
>>>>>> There are no false negatives (0 percent rank 10), but tol / S.min() is
>>>>>> around 110 - so conservative, in this case.
>>>>>>
>>>>>> So - summary - I'm worrying our current threshold is too small,
>>>>>> letting through many rank-deficient matrices without detection.  I may
>>>>>> have misread Golub and Van Loan, but maybe we aren't doing what they
>>>>>> suggest.  Maybe what we could use is either the MATLAB threshold or
>>>>>> something like:
>>>>>>
>>>>>> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
>>>>>>
>>>>>> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This
>>>>>> gives 0 percent misses and tol / S.min() of 8.7.
>>>>>>
>>>>>> What do y'all think?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Matthew
>>>>>>
>>>>>> [1]
>>>>>> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon
>>>>>> [2] http://www.mathworks.com/help/techdoc/ref/rank.html
>>>>>>
>>>>>> Output from script:
>>>>>>
>>>>>> Percent undetected current: 9.8, tol / S.min(): 2.762
>>>>>> Percent undetected inf norm: 39.1, tol / S.min(): 1.667
>>>>>> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367
>>>>>> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734
>>>>>> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
>>>>>>
>>>>>> <script>
>>>>>> import numpy as np
>>>>>> import scipy.linalg as npl
>>>>>>
>>>>>> M = 40
>>>>>> N = 10
>>>>>>
>>>>>> def make_deficient():
>>>>>>    X = np.random.normal(size=(M, N))
>>>>>>    deficient_X = X.copy()
>>>>>>    if M > N: # Make a column deficient
>>>>>>        deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>>>>>>    else: # Make a row deficient
>>>>>>        deficient_X[0] = deficient_X[1] + deficient_X[2]
>>>>>>    return deficient_X
>>>>>>
>>>>>> matrices = []
>>>>>> ranks = []
>>>>>> ranks_inf = []
>>>>>> ranks_ub_inf = []
>>>>>> ranks_ub2_inf = []
>>>>>> ranks_mlab = []
>>>>>> tols = np.zeros((1000, 6))
>>>>>> for i in range(1000):
>>>>>>    m = make_deficient()
>>>>>>    matrices.append(m)
>>>>>>    # The SVD tolerances
>>>>>>    S = npl.svd(m, compute_uv=False)
>>>>>>    S0 = S.max()
>>>>>>    # u in Golub and Van Loan == numpy eps / 2
>>>>>>    eps = np.finfo(m.dtype).eps
>>>>>>    u = eps / 2
>>>>>>    # Current numpy matrix_rank algorithm
>>>>>>    ranks.append(np.linalg.matrix_rank(m))
>>>>>>    # Which is the same as:
>>>>>>    tol_s0 = S0 * eps
>>>>>>    # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0))
>>>>>>    # Golub and Van Loan suggestion
>>>>>>    tol_inf = npl.norm(m, np.inf) * u
>>>>>>    ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf))
>>>>>>    # Upper bound of |X|_{inf}
>>>>>>    tol_ub_inf = tol_s0 * np.sqrt(N) / 2
>>>>>>    ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf))
>>>>>>    # Times 2 fudge
>>>>>>    tol_ub2_inf = tol_s0 * np.sqrt(N)
>>>>>>    ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf))
>>>>>>    # MATLAB algorithm
>>>>>>    tol_mlab = tol_s0 * max(m.shape)
>>>>>>    ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab))
>>>>>>    # Collect tols
>>>>>>    tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
>>>>>>
>>>>>> rel_tols = tols / tols[:, -1][:, None]
>>>>>>
>>>>>> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f'
>>>>>> max_rank = min(M, N)
>>>>>> for name, ranks, mrt in zip(
>>>>>>    ('current', 'inf norm', 'upper bound inf norm',
>>>>>>     'upper bound inf norm * 2', 'MATLAB'),
>>>>>>    (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab),
>>>>>>    rel_tols.mean(axis=0)[:5]):
>>>>>>    pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100
>>>>>>    print fmt % (name, pcnt, mrt)
>>>>>> </script>
>>>>>
>>>>>
>>>>> The polynomial fitting uses eps times the largest array dimension for the
>>>>> relative condition number. IIRC, that choice traces back to numerical
>>>>> recipes.
>>>
>>> Chuck - sorry - I didn't understand what you were saying, and now I
>>> think you were proposing the MATLAB algorithm.   I can't find that in
>>> Numerical Recipes - can you?  It would be helpful as a reference.
>>>
>>>> This is the same as Matlab, right?
>>>
>>> Yes, I believe so, i.e:
>>>
>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
>>>
>>> from my original email.
>>>
>>>> If the Matlab condition is the most conservative, then it seems like a
>>>> reasonable choice -- conservative is good so long as your false
>>>> positive rate doesn't become to high, and presumably Matlab has enough
>>>> user experience to know whether the false positive rate is too high.
>>>
>>> Are we agreeing to go for the Matlab algorithm?
>>>
>>> If so, how should this be managed?  Just changing it may change the
>>> output of code using numpy >= 1.5.0, but then again, the threshold is
>>> probably incorrect.
>>>
>>> Fix and break or FutureWarning with something like:
>>>
>>> def matrix_rank(M, tol=None):
>>>
>>> where ``tol`` can be a string like ``maxdim``?
>>
>> I dunno, I don't think we should do a big deprecation dance for every
>> bug fix. Is this a bug fix, so numpy will simply start producing more
>> accurate results on a given problem? I guess there isn't really a
>> right answer here (though claiming that [a, b, a+b] is full-rank is
>> clearly broken, and the matlab algorithm seems reasonable for
>> answering the specific question of whether a matrix is full rank), so
>> we'll have to hope some users speak up...
>
>I don't see a problem changing this as a bugfix.
>statsmodels still has, I think, the original scipy.stats.models
>version for rank which is still much higher for any non-huge array and
>float, cond=1.0e-12.
>
>Josef

+ 1 for making the default "matlab" : it sounds like it would be the least
confusing. It also seems to me that a bug fix is probably right procedure.
Last, I like best having only the matlab default (options seem uncessary).
cheers
JB