[Numpy-discussion] Matrix rank default tolerance - is it too low?

Thu Jun 14 22:06:44 EDT 2012

Hi,

I noticed that numpy.linalg.matrix_rank sometimes gives full rank for
matrices that are numerically rank deficient:

If I repeatedly make random matrices, then set the first column to be
equal to the sum of the second and third columns:

def make_deficient():
   X = np.random.normal(size=(40, 10))
   deficient_X = X.copy()
   deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
   return deficient_X

then the current numpy.linalg.matrix_rank algorithm returns full rank
(10) in about 8 percent of cases (see appended script).

I think this is a tolerance problem.  The ``matrix_rank`` algorithm
does this by default:

S = spl.svd(M, compute_uv=False)
tol = S.max() * np.finfo(S.dtype).eps
return np.sum(S > tol)

I guess we'd we want the lowest tolerance that nearly always or always
identifies numerically rank deficient matrices.  I suppose one way of
looking at whether the tolerance is in the right range is to compare
the calculated tolerance (``tol``) to the minimum singular value
(``S.min()``) because S.min() in our case should be very small and
indicate the rank deficiency. The mean value of tol / S.min() for the
current algorithm, across many iterations, is about 2.8.  We might
hope this value would be higher than 1, but not much higher, otherwise
we might be rejecting too many columns.

Our current algorithm for tolerance is the same as the 2-norm of M *
eps.  We're citing Golub and Van Loan for this, but now I look at our
copy (p 261, last para) - they seem to be suggesting using u * |M|
where u = (p 61, section 2.4.2) eps /  2. (see [1]). I think the Golub
and Van Loan suggestion corresponds to:

tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2

This tolerance gives full rank for these rank-deficient matrices in
about 39 percent of cases (tol / S.min() ratio of 1.7)

We see on p 56 (section 2.3.2) that:

m, n = M.shape
1 / sqrt(n) . |M|_{inf} <= |M|_2

So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).  Setting:

tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)

gives about 0.5 percent error (tol / S.min() of 4.4)

Using the Mathworks threshold [2]:

tol = S.max() * np.finfo(M.dtype).eps * max((m, n))

There are no false negatives (0 percent rank 10), but tol / S.min() is
around 110 - so conservative, in this case.

So - summary - I'm worrying our current threshold is too small,
letting through many rank-deficient matrices without detection.  I may
have misread Golub and Van Loan, but maybe we aren't doing what they
suggest.  Maybe what we could use is either the MATLAB threshold or
something like:

tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)

- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This
gives 0 percent misses and tol / S.min() of 8.7.

What do y'all think?

Best,

Matthew

[1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon
[2] http://www.mathworks.com/help/techdoc/ref/rank.html

Output from script:

Percent undetected current: 9.8, tol / S.min(): 2.762
Percent undetected inf norm: 39.1, tol / S.min(): 1.667
Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367
Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734
Percent undetected MATLAB: 0.0, tol / S.min(): 110.477

<script>
import numpy as np
import scipy.linalg as npl

M = 40
N = 10

def make_deficient():
    X = np.random.normal(size=(M, N))
    deficient_X = X.copy()
    if M > N: # Make a column deficient
        deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
    else: # Make a row deficient
        deficient_X[0] = deficient_X[1] + deficient_X[2]
    return deficient_X

matrices = []
ranks = []
ranks_inf = []
ranks_ub_inf = []
ranks_ub2_inf = []
ranks_mlab = []
tols = np.zeros((1000, 6))
for i in range(1000):
    m = make_deficient()
    matrices.append(m)
    # The SVD tolerances
    S = npl.svd(m, compute_uv=False)
    S0 = S.max()
    # u in Golub and Van Loan == numpy eps / 2
    eps = np.finfo(m.dtype).eps
    u = eps / 2
    # Current numpy matrix_rank algorithm
    ranks.append(np.linalg.matrix_rank(m))
    # Which is the same as:
    tol_s0 = S0 * eps
    # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0))
    # Golub and Van Loan suggestion
    tol_inf = npl.norm(m, np.inf) * u
    ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf))
    # Upper bound of |X|_{inf}
    tol_ub_inf = tol_s0 * np.sqrt(N) / 2
    ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf))
    # Times 2 fudge
    tol_ub2_inf = tol_s0 * np.sqrt(N)
    ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf))
    # MATLAB algorithm
    tol_mlab = tol_s0 * max(m.shape)
    ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab))
    # Collect tols
    tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()

rel_tols = tols / tols[:, -1][:, None]

fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f'
max_rank = min(M, N)
for name, ranks, mrt in zip(
    ('current', 'inf norm', 'upper bound inf norm',
     'upper bound inf norm * 2', 'MATLAB'),
    (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab),
    rel_tols.mean(axis=0)[:5]):
    pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100
    print fmt % (name, pcnt, mrt)
</script>