[issue31889] difflib SequenceMatcher ratio() still have unpredictable behavior

Tim Peters report at bugs.python.org
Fri Nov 3 21:42:35 EDT 2017


Tim Peters <tim at python.org> added the comment:

Pass "autojunk=False" to your SequenceMatcher constructor and the ratio you get back will continue to increase as `i` increases.

The docs:

"""
Automatic junk heuristic: SequenceMatcher supports a heuristic that automatically treats certain sequence items as junk. The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching. This heuristic can be turned off by setting the autojunk argument to False when creating the SequenceMatcher.
"""

Note in particular the "at least 200 items long" there.  That's why you see a change in behavior at i == 200.  Before then, the autojunk heuristic is not triggered.

If we had it all to do over again, I'd make autojunk=False the default.  But - for backward compatibility - it's too late to change that now.

As to quick_ratio() and real_quick_ratio(), NOTHING about them is defined - or intended to be defined - beyond what the docs already say:

"""
quick_ratio()
Return an upper bound on ratio() relatively quickly.

real_quick_ratio()
Return an upper bound on ratio() very quickly.
"""

Nothing in the code you posted violates those promises (e.g., 0.99 is certainly an upper bound with respect to 0.0).  If they always returned 1.0, regardless of inputs, that would meet the docs' promises too.  Only the result of ratio() is defined.

As to why they exist, the docs explain that:

"""
ratio()
Return a measure of the sequences’ similarity as a float in the range [0, 1].

...

This is expensive to compute if get_matching_blocks() or get_opcodes() hasn’t already been called, in which case you may want to try quick_ratio() or real_quick_ratio() first to get an upper bound.
"""

So if the expense of calling ratio() isn't a concern in your context, there's no reason to call quick_ratio() or real_quick_ratio().  They exist so it's possible to code a cheap "early out" path in contexts where calling ratio() all the time may be prohibitively expensive.

For example, consider an app that has no use at all for a ratio < 0.5.  If quick_ratio() (or real_quick_ratio()) return something less than 0.5, then it's guaranteed that ratio() would too, since the quicker versions return upper bounds on what ratio() would return.

This is valuable where it matters, and completely useless where it doesn't matter ;-)

In any case, since `autojunk=False` repairs the behavior you pointed out, I'm closing this as Won't-Fix.

----------
resolution:  -> wont fix
stage:  -> resolved
status: open -> closed

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue31889>
_______________________________________


More information about the Python-bugs-list mailing list