Percentage matching of text

Bruce Eckel BruceEckel at MailBlocks.com
Fri Jul 30 12:50:57 EDT 2004


Thanks! The stringcmp.py module not only has nicely-documented
function calls, but also gives examples of how to use difflib.

It has also occurred to me that many of my examples will have text
that is "position consistent" between the control sample and test
sample -- that is, the stuff that matches will often be in exactly the
same place. So I might just be able to march through
position-by-position and do a simple comparison. (But I had to ask the
question on the newsgroup before I could think of the answer myself.
Funny how that works).

Anyway, I now seem to have some good footholds.

Friday, July 30, 2004, 9:20:06 AM, you wrote:

> On Fri, 2004-07-30 at 23:52, Bruce Eckel wrote:
>> What I'd like to do is find an algorithm that produces the results of
>> a text comparison as a percentage-match. Thus I would be able to
>> assert that my test samples must match the control sample by at least
>> (for example) 83% for the test to pass. Clearly, this wouldn't be a
>> perfect test but it would help flag problems, which is primarily what
>> I need.
>> 
>> Does anyone know of an algorithm or library that would do this? Thanks
>> in advance.

> Python implementations of a range of such algorithms can be found in
> Febrl - see section 9.2 of the manual:
> http://datamining.anu.edu.au/projects/linkage.html#prototype_software

> I suspect that a simple bigram comparison would meet your needs best. Or
> just use the Python difflib module in the standard Python library which
> implements the Ratcliff-Obershelp comparator.
> -- 

> Tim C

> PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere
> or at http://members.optushome.com.au/tchur/pubkey.asc
> Key fingerprint = 8C22 BF76 33BA B3B5 1D5B  EB37 7891 46A9 EAF9 93D0





Bruce Eckel    http://www.BruceEckel.com   mailto:BruceEckel at MailBlocks.com
Contains electronic books: "Thinking in Java 3e" & "Thinking in C++ 2e"
Web log: http://www.mindview.net/WebLog
Subscribe to my newsletter:
http://www.mindview.net/Newsletter
My schedule can be found at:
http://www.mindview.net/Calendar

"The whole problem with the world is that fools and fanatics are always
so certain of themselves, and wiser people so full of doubts."
  --Bertrand Russell





More information about the Python-list mailing list