Percentage matching of text
Tim Peters
tim.peters at gmail.com
Fri Jul 30 16:07:18 EDT 2004
[Bruce Eckel]
> Ah, but here's an interesting one:
Not really <wink>. Do read the module docstrings and comments. It's
all explained there.
> >>> sm(None, 'abcd', 'abdc').ratio()
> 0.75
>
> >>> sm(None, 'abcd', 'abxx').ratio()
> 0.5
>
> So if it matches half the string it's 50% but if the last two
> characters are out of order that's an additional 25%.
I believe if you asked any human "cold" which of those two examples
uses strings that were "more similar", they'd agree that the first
pair is more similar.
> Other examples:
>
> >>> sm(None, 'abcd10', 'abdc20').quick_ratio()
> 0.83333333333333337
> >>> sm(None, 'abcd10', 'abdc20').ratio()
> 0.66666666666666663
> >>> sm(None, 'abcd10', 'abdc20').real_quick_ratio()
> 1.0
>
> You get a different interpretation for each "speed" of ratio.
As the docs say,
def quick_ratio(self):
"""Return an upper bound on ratio() relatively quickly.
This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute.
"""
...
def real_quick_ratio(self):
"""Return an upper bound on ratio() very quickly.
This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute than either .ratio() or .quick_ratio().
"""
Diffs of long sequences can be expensive to compute, and these guys
are speed hacks normally used in the stylized way
difflib.get_close_matches() uses them:
if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:
That way the potentially expensive s.ratio() isn't called if the
cheap-ass upper bounds are already so poor that there's no chance of
meeting the cutoff.
> I started thinking that all I wanted was a pass-fail kind of thing so
> I wondered if real_quick_ratio() might do the trick.
No, nothing is defined about what real_quick_ratio() computes, beyond
that it's an upper bound on .ratio(). It would be fine if it always
returned 1.0, regardless of input. It wouldn't be *useful* then, but
it would be fine <wink>. What's wrong with using .ratio() directly?
The speed-hack upper-bound methods can pay hugely in speeding MxN
cross-product kinds of comparison tasks.
...
> It seems like there's no way to get real_quick_ratio() to say anything
> except "it's a perfect match!"
Not so, but you can read the code to learn how. What it does is about
the cheapest non-constant thing imaginable (it doesn't even look at
the elements of the sequences!), but it's still valuable for speeding
cross-product similarity applications.
More information about the Python-list
mailing list