[Numpy-discussion] Matrix rank default tolerance - is it too low?

Tue Jun 26 17:42:47 EDT 2012

Hi,

On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
> Hi,
>
> On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>> On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.brett at gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>> >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris
>>> >> <charlesr.harris at gmail.com> wrote:
>>> >>>
>>> >>>
>>> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett
>>> >>> <matthew.brett at gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for
>>> >>>> matrices that are numerically rank deficient:
>>> >>>>
>>> >>>> If I repeatedly make random matrices, then set the first column to be
>>> >>>> equal to the sum of the second and third columns:
>>> >>>>
>>> >>>> def make_deficient():
>>> >>>>    X = np.random.normal(size=(40, 10))
>>> >>>>    deficient_X = X.copy()
>>> >>>>    deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>>> >>>>    return deficient_X
>>> >>>>
>>> >>>> then the current numpy.linalg.matrix_rank algorithm returns full rank
>>> >>>> (10) in about 8 percent of cases (see appended script).
>>> >>>>
>>> >>>> I think this is a tolerance problem.  The ``matrix_rank`` algorithm
>>> >>>> does this by default:
>>> >>>>
>>> >>>> S = spl.svd(M, compute_uv=False)
>>> >>>> tol = S.max() * np.finfo(S.dtype).eps
>>> >>>> return np.sum(S > tol)
>>> >>>>
>>> >>>> I guess we'd we want the lowest tolerance that nearly always or
>>> >>>> always
>>> >>>> identifies numerically rank deficient matrices.  I suppose one way of
>>> >>>> looking at whether the tolerance is in the right range is to compare
>>> >>>> the calculated tolerance (``tol``) to the minimum singular value
>>> >>>> (``S.min()``) because S.min() in our case should be very small and
>>> >>>> indicate the rank deficiency. The mean value of tol / S.min() for the
>>> >>>> current algorithm, across many iterations, is about 2.8.  We might
>>> >>>> hope this value would be higher than 1, but not much higher,
>>> >>>> otherwise
>>> >>>> we might be rejecting too many columns.
>>> >>>>
>>> >>>> Our current algorithm for tolerance is the same as the 2-norm of M *
>>> >>>> eps.  We're citing Golub and Van Loan for this, but now I look at our
>>> >>>> copy (p 261, last para) - they seem to be suggesting using u * |M|
>>> >>>> where u = (p 61, section 2.4.2) eps /  2. (see [1]). I think the
>>> >>>> Golub
>>> >>>> and Van Loan suggestion corresponds to:
>>> >>>>
>>> >>>> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
>>> >>>>
>>> >>>> This tolerance gives full rank for these rank-deficient matrices in
>>> >>>> about 39 percent of cases (tol / S.min() ratio of 1.7)
>>> >>>>
>>> >>>> We see on p 56 (section 2.3.2) that:
>>> >>>>
>>> >>>> m, n = M.shape
>>> >>>> 1 / sqrt(n) . |M|_{inf} <= |M|_2
>>> >>>>
>>> >>>> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).
>>> >>>>  Setting:
>>> >>>>
>>> >>>> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
>>> >>>>
>>> >>>> gives about 0.5 percent error (tol / S.min() of 4.4)
>>> >>>>
>>> >>>> Using the Mathworks threshold [2]:
>>> >>>>
>>> >>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
>>> >>>>
>>> >>>> There are no false negatives (0 percent rank 10), but tol / S.min()
>>> >>>> is
>>> >>>> around 110 - so conservative, in this case.
>>> >>>>
>>> >>>> So - summary - I'm worrying our current threshold is too small,
>>> >>>> letting through many rank-deficient matrices without detection.  I
>>> >>>> may
>>> >>>> have misread Golub and Van Loan, but maybe we aren't doing what they
>>> >>>> suggest.  Maybe what we could use is either the MATLAB threshold or
>>> >>>> something like:
>>> >>>>
>>> >>>> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
>>> >>>>
>>> >>>> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) .
>>> >>>> This
>>> >>>> gives 0 percent misses and tol / S.min() of 8.7.
>>> >>>>
>>> >>>> What do y'all think?
>>> >>>>
>>> >>>> Best,
>>> >>>>
>>> >>>> Matthew
>>> >>>>
>>> >>>> [1]
>>> >>>>
>>> >>>> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon
>>> >>>> [2] http://www.mathworks.com/help/techdoc/ref/rank.html
>>> >>>>
>>> >>>> Output from script:
>>> >>>>
>>> >>>> Percent undetected current: 9.8, tol / S.min(): 2.762
>>> >>>> Percent undetected inf norm: 39.1, tol / S.min(): 1.667
>>> >>>> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367
>>> >>>> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min():
>>> >>>> 8.734
>>> >>>> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
>>> >>>>
>>> >>>> <script>
>>> >>>> import numpy as np
>>> >>>> import scipy.linalg as npl
>>> >>>>
>>> >>>> M = 40
>>> >>>> N = 10
>>> >>>>
>>> >>>> def make_deficient():
>>> >>>>    X = np.random.normal(size=(M, N))
>>> >>>>    deficient_X = X.copy()
>>> >>>>    if M > N: # Make a column deficient
>>> >>>>        deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>>> >>>>    else: # Make a row deficient
>>> >>>>        deficient_X[0] = deficient_X[1] + deficient_X[2]
>>> >>>>    return deficient_X
>>> >>>>
>>> >>>> matrices = []
>>> >>>> ranks = []
>>> >>>> ranks_inf = []
>>> >>>> ranks_ub_inf = []
>>> >>>> ranks_ub2_inf = []
>>> >>>> ranks_mlab = []
>>> >>>> tols = np.zeros((1000, 6))
>>> >>>> for i in range(1000):
>>> >>>>    m = make_deficient()
>>> >>>>    matrices.append(m)
>>> >>>>    # The SVD tolerances
>>> >>>>    S = npl.svd(m, compute_uv=False)
>>> >>>>    S0 = S.max()
>>> >>>>    # u in Golub and Van Loan == numpy eps / 2
>>> >>>>    eps = np.finfo(m.dtype).eps
>>> >>>>    u = eps / 2
>>> >>>>    # Current numpy matrix_rank algorithm
>>> >>>>    ranks.append(np.linalg.matrix_rank(m))
>>> >>>>    # Which is the same as:
>>> >>>>    tol_s0 = S0 * eps
>>> >>>>    # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0))
>>> >>>>    # Golub and Van Loan suggestion
>>> >>>>    tol_inf = npl.norm(m, np.inf) * u
>>> >>>>    ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf))
>>> >>>>    # Upper bound of |X|_{inf}
>>> >>>>    tol_ub_inf = tol_s0 * np.sqrt(N) / 2
>>> >>>>    ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf))
>>> >>>>    # Times 2 fudge
>>> >>>>    tol_ub2_inf = tol_s0 * np.sqrt(N)
>>> >>>>    ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf))
>>> >>>>    # MATLAB algorithm
>>> >>>>    tol_mlab = tol_s0 * max(m.shape)
>>> >>>>    ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab))
>>> >>>>    # Collect tols
>>> >>>>    tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab,
>>> >>>> S.min()
>>> >>>>
>>> >>>> rel_tols = tols / tols[:, -1][:, None]
>>> >>>>
>>> >>>> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f'
>>> >>>> max_rank = min(M, N)
>>> >>>> for name, ranks, mrt in zip(
>>> >>>>    ('current', 'inf norm', 'upper bound inf norm',
>>> >>>>     'upper bound inf norm * 2', 'MATLAB'),
>>> >>>>    (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab),
>>> >>>>    rel_tols.mean(axis=0)[:5]):
>>> >>>>    pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100
>>> >>>>    print fmt % (name, pcnt, mrt)
>>> >>>> </script>
>>> >>>
>>> >>>
>>> >>> The polynomial fitting uses eps times the largest array dimension for
>>> >>> the
>>> >>> relative condition number. IIRC, that choice traces back to numerical
>>> >>> recipes.
>>> >
>>> > Chuck - sorry - I didn't understand what you were saying, and now I
>>> > think you were proposing the MATLAB algorithm.   I can't find that in
>>> > Numerical Recipes - can you?  It would be helpful as a reference.
>>> >
>>> >> This is the same as Matlab, right?
>>> >
>>> > Yes, I believe so, i.e:
>>> >
>>> > tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
>>> >
>>> > from my original email.
>>> >
>>> >> If the Matlab condition is the most conservative, then it seems like a
>>> >> reasonable choice -- conservative is good so long as your false
>>> >> positive rate doesn't become to high, and presumably Matlab has enough
>>> >> user experience to know whether the false positive rate is too high.
>>> >
>>> > Are we agreeing to go for the Matlab algorithm?
>>>
>>> As extra data, current Numerical Recipes (2007, p 67) appears to prefer:
>>>
>>> tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
>>
>>
>> That's interesting, as something like that with a square root was my first
>> choice for the least squares, but then someone mentioned the NR choice. That
>> was all on the mailing list way several years back when I was fixing up the
>> polynomial fitting routine. The NR reference is on page 517 of the 1986
>> edition (FORTRAN), which might be hard to come by these days ;)
>
> Thanks for tracking that down, it's very helpful.
>
> For those of you not near a huge University library or your own
> private copy, p517 says:
>
> "A plausible answer to the question "how small is small", is to edit
> in this fashion all singular values whose ratio to the largest
> singular value is less then N times the machine precision \epsilon.
> (You might argue for root N, or a constant, instead of N as the
> multiple; that starts getting into hardware-dependent questions).
>
> Earlier (p510) we see the (General Linear Least Squares) problem being
> set up as A = (N x M) where N >= M.
>
> The 2007 edition replaces the "(You might argue... )" text with: (p 795)
>
> "(This is a more conservative recommendation than the default in
> section 2.6 which scales as N^{1/2})"
>
> and this in turn refers to the threshold:
>
> tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
>
> (p67) - which is justified as being (p71) " ... a default value based
> on expected roundoff error".
>
> I could not at first glance see any other justification for this
> threshold in the text.
>
> So, how about something like:
>
> def matrix_rank(M, tol='maxdim'):
> ...
>
>    tol: {'maxdim', 'nr-roundoff'} or float
>         If str, gives threshold strategy for tolerance relative to
> the maximum singular value, explained below.  If float gives absolute
> tolerance below which singular values assumed zero.  For the threshold
> strategies,  we will call the maximum singular value``S.max()`` and
> the floating point epsilon for the working precision data type
> ``eps``.  Default strategy is 'maxdim'   corresponding to ``tol =
> S.max() * eps * max(M.shape)``.  This is the MATLAB default; see also
> Numerical Recipes 2007.  Other options are 'nr-roundoff' (also from
> Numerical Recipes 2007) corresponding to ``tol = S.max() * eps / 2 *
> np.sqrt(M.shape[0] + M.shape[1] + 1)``.
>
> ?

Thinking about this more, I'm tempted to just go for the matlab / old
NR solution, that is:

 tol = S.max() * np.finfo(M.dtype).eps * max((m, n))

and not offer any other options.  My reasoning is that the matrix_rank
code is basically trivial; it is a convenience function.  If someone
wants a better matrix rank it's reasonable to copy these few lines
into a new function.   So providing multiple options for tolerance
seems like overkill.

Then the question is whether to use NR-2007:

tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)

or the matlab default above.   I'm leaning to the matlab version
because we can be more or less sure that it does not miss actual
numerical rank deficiency and it might be what people are expecting,
if they shift from matlab or compare to matlab.  The NR-2007 might
well be closer to an accurate threshold for numerical rank deficiency,
but I haven't tested it with all variants and all platforms, and the
reduced false positives seems a minor gain compared to the slight
increase in risk of false negatives and difference from matlab.

What do y'all think,

Matthew