[Numpy-discussion] Matrix rank default tolerance - is it too low?

Tue Jun 26 22:29:01 EDT 2012

Hi,

On Tue, Jun 26, 2012 at 5:04 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Tue, Jun 26, 2012 at 5:46 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>>
>> Hi,
>>
>> On Tue, Jun 26, 2012 at 4:39 PM, Benjamin Root <ben.root at ou.edu> wrote:
>> >
>> >
>> > On Tuesday, June 26, 2012, Charles R Harris wrote:
>> >>
>> >>
>> >>
>> >> On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett
>> >> <matthew.brett at gmail.com>
>> >> wrote:
>> >>
>> >> Hi,
>> >>
>> >> On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett
>> >> <matthew.brett at gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris
>> >> > <charlesr.harris at gmail.com> wrote:
>> >> >>
>> >> >>
>> >> >> On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett
>> >> >> <matthew.brett at gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett
>> >> >>> <matthew.brett at gmail.com>
>> >> >>> wrote:
>> >> >>> > Hi,
>> >> >>> >
>> >> >>> > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs at pobox.com>
>> >> >>> > wrote:
>> >> >>> >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris
>> >> >>> >> <charlesr.harris at gmail.com> wrote:
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett
>> One potential problem is that it implies that it will always be the
>> same as any version of matlab's tolerance.  What if they change it in
>> a future release? How likely are we to even notice?
>>
>>
>> >> >>> >>> <matthew.brett at gmail.com>
>> >> >>> >>> wrote:
>> >> >>> >>>>
>> >> >>> >>>> Hi,
>> >> >>> >>>>
>> >> >>> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full
>> >> >>> >>>> rank
>> >> >>> >>>> for
>> >> >>> >>>> matrices that are numerically rank deficient:
>> >> >>> >>>>
>> >> >>> >>>> If I repeatedly make random matrices, then set the first
>> >> >>> >>>> column
>> >> >>> >>>> to be
>> >> >>> >>>> equal to the sum of the second and third columns:
>> >> >>> >>>>
>> >> >>> >>>> def make_deficient():
>> >> >>> >>>>    X = np.random.normal(size=(40, 10))
>> >> >>> >>>>    deficient_X = X.copy()
>> >> >>> >>>>    deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>> >> >>> >>>>    return deficient_X
>> >> >>> >>>>
>> >> >>> >>>> then the current numpy.linalg.matrix_rank algorithm returns
>> >> >>> >>>> full
>> >> >>> >>>> rank
>> >> >>> >>>> (10) in about 8 percent of cases (see appended script).
>> >> >>> >>>>
>> >> >>> >>>> I think this is a tolerance problem.  The ``matrix_rank``
>> >> >>> >>>> algorithm
>> >> >>> >>>> does this by default:
>> >> >>> >>>>
>> >> >>> >>>> S = spl.svd(M, compute_uv=False)
>> >> >>> >>>> tol = S.max() * np.finfo(S.dtype).eps
>> >> >>> >>>> return np.sum(S > tol)
>> >> >>> >>>>
>> >> >>> >>>> I guess we'd we want the lowest tolerance that nearly always
>> >> >>> >>>> or
>> >> >>> >>>> always
>> >> >>> >>>> identifies numerically rank deficient matrices.  I suppose one
>> >> >>> >>>> way of
>> >> >>> >>>> looking at whether the tolerance is in the right range is to
>> >> >>> >>>> compare
>> >> >>> >>>> the calculated tolerance (``tol``) to the minimum singular
>> >> >>> >>>> value
>> >> >>> >>>> (``S.min()``) because S.min() in our case should be very small
>> >> >>> >>>> and
>> >> >>> >>>> indicate the rank deficiency. The mean value of tol / S.min()
>> >> >>> >>>> for
>> >> >>> >>>> the
>> >> >>> >>>> current algorithm, across many iterations, is about 2.8.  We
>> >> >>> >>>> might
>> >> >>> >>>> hope this value would be higher than 1, but not much higher,
>> >> >>> >>>> otherwise
>> >> >>> >>>> we might be rejecting too many columns.
>> >> >>> >>>>
>> >> >>> >>>> Our current algorithm for tolerance is the same as the 2-norm
>> >> >>> >>>> of
>> >> >>> >>>> M *
>> >> >>> >>>> eps.  We're citing Golub and Van Loan for this, but now I look
>> >> >>> >>>> at
>> >> >>> >>>> our
>> >> >>> >>>> copy (p 261, last para) - they seem to be suggesting using u *
>> >> >>> >>>> |M|
>> >> >>> >>>> where u = (p 61, section 2.4.2) eps /  2. (see [1]). I think
>> >> >>> >>>> the
>> >> >>> >>>> Golub
>> >>
>> >>
>> >> I'm fine with that, and agree that it is likely to lead to fewer folks
>> >> wondering why Matlab and numpy are different. A good explanation in the
>> >> function documentation would be useful.
>> >>
>> >> Chuck
>> >>
>> >
>> > One potential problem is that it implies that it will always be the same
>> > as
>> > any version of matlab's tolerance.  What if they change it in a future
>> > release? How likely are we to even notice?
>>
>> I guess that matlab is unlikely to change for the same reason that we
>> would be reluctant to change, once we've found an acceptable value.
>>
>> I was thinking that we would say something like:
>>
>> """
>> The default tolerance is :
>>
>>  tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
>>
>> This corresponds to the tolerance suggested in NR page X, and to the
>> tolerance used by MATLAB at the time of writing (June 2012; see
>> http://www.mathworks.com/help/techdoc/ref/rank.html).
>> """
>>
>> I don't know whether we would want to track changes made by matlab -
>> maybe we could have that discussion if they do change?
>
>
> I wouldn't bother tracking Matlab, but I think the alternative threshold
> could be mentioned in the notes. Something like
>
> A less conservative threshold is ...
>
> Maybe mention that because of numerical uncertainty there will always be a
> chance that the computed rank could be wrong, but that with the conservative
> threshold the rank is very unlikely to be less than the computed rank.

Sounds good to me.   Would anyone object to a pull request with these
changes (matlab tolerance default, description in docstring)?

Cheers,

Matthew