find_longest_match in SequenceMatcher

Mon Jul 24 01:31:07 EDT 2006

koara wrote:
> Hello, it might be too late or too hot, but i cannot work out this
> behaviour of find_longest_match() in difflib.SequenceMatcher:
>
> string1:
[snipped 500-byte string]
>
> string2:
>
[snipped 500-byte string]
>
> find_longest_match(0,500,0,500)=(24,43,10)="version01t"
>
> What? O_o Clearly there is a longer match, right at the beginning!
> And then, after removal of the last character from each string (i found
> the limit of 500 by trial and error -- and it looks suspiciously
> rounded):

What limit? (a) My results [see below] (b) my inspection of the Python
version 2.4 source for the difflib module (c) what I know of the author
-- all tend to indicate that there is no hidden undocumented length
limit.

>
> find_longest_match(0,499,0,499)=(0,0,32)="releasenotesforwildmagicversion0"
>
>
> Is this the expected behaviour? What's going on?

My code: (koara.py)
8<---
strg1 =
r"""releasenotesforwildmagicversion01thiscdromcontainstheinitialreleaseofthesourcecodethataccompaniesthebook"3dgameenginedesign:apracticalapproachtorealtimecomputergraphics"thereareanumberofknownissuesaboutthecodeastheseissuesareaddressedtheupdatedcodewillbeavailableatthewebsitehttp://wwwmagicsoftwarecom/3dgameenginedesignhtmlbugssuggestionsforimprovementsandothercorrespondencecanbesenttosupport@magicsoftwarecomthecurrentknownissuesare1meshalgorithmforcontinuouslevelofdetailappearsnottobeworkingbase"""
strg2 =
r"""releasenotesforwildmagicversion02updatefromversion01toversion02ifyourcopyofthebookhasversion01andifyoudownloadedversion02fromthewebsitethenapplythefollowingdirectionsforinstallingtheupdateforalinuxinstallationseethesectionattheendofthisdocumentupdatedirectionsassumingthatthetopleveldirectoryiscalledmagicreplacebyyourtoplevelnameyoushouldhavetheversion01contentsinthislocation1deletethecontentsofmagic\include2deletethesubdirectorymagic\source\mgcapplication3deletetheobsoletefiles:amagic\source\mgc"""
import sys
print sys.version
from difflib import SequenceMatcher as SM
smo = SM(None, strg1, strg2)
print len(strg1), len(strg2)
print smo.find_longest_match(0, 500, 0, 500)
print smo.find_longest_match(0, 499, 0, 499)
print smo.find_longest_match(0, 100, 0, 100)
print smo.find_longest_match(1, 101, 1, 101)
print smo.find_longest_match(2, 102, 2, 102)
8<---

The results on 4 python versions:

C:\junk>c:\python24\python koara.py
2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
500 500
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)

C:\junk>c:\python23\python koara.py
2.3.5 (#62, Feb  8 2005, 16:23:02) [MSC v.1200 32 bit (Intel)]
500 500
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)

C:\junk>c:\python22\python koara.py
2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)]
500 500
(0, 0, 32)
(0, 0, 32)
(0, 0, 32)
(1, 1, 31)
(2, 2, 30)

C:\junk>c:\python21\python koara.py
2.1.3 (#35, Apr  8 2002, 17:47:50) [MSC 32 bit (Intel)]
500 500
(0, 0, 32)
(0, 0, 32)
(0, 0, 32)
(1, 1, 31)
(2, 2, 30)

Looks to me like the problem has nothing at all to do with the length
of the searched strings, but a bug appeared in 2.3.  What version(s)
were you using? Can you reproduce your results (500 & 499 giving
different answers) with the same version?

Anyway, as they say in the classics, "Take a number; the timbot will be
with you shortly."

Cheers,
John