locating strings approximately

Thu Jun 29 19:37:14 EDT 2006

On 29/06/2006 10:52 AM, John Machin wrote:
> On 29/06/2006 10:07 AM, BBands wrote:
>> On 6/28/06, John Machin <sjmachin at lexicon.net> wrote:
>>> On 29/06/2006 9:28 AM, BBands wrote:
>>> > I'd like to see if a string exists, even approximately, in another. 
>>> For
>>> > example if "black" exists in "blakbird" or if "beatles" exists in
>>> > "beatlemania". The application is to look though a long list of songs
>>> > and return any approximate matches along with a confidence factor. I
>>> > have looked at edit distance, but that isn't a good choice for finding
>>> > a short string in a longer one.
>>>
>>> There is a trivial difference between the traditional
>>> distance-matrix-based Levenshtein algorithm for edit distance and the
>>> corresponding one for approximate string searching. Ditto between
>>> finite-state-machine approaches. Ditto between modern bit-bashing
>>> approaches.

The trivial difference is that in the searching case one edge of the 
dynamic programming matrix is initialised (in theory) to [0] * 
(len(text)+1) whereas in the distance case it is set to range(len(text)+1).

>>>
>>> > I have also explored
>>> > difflib.SequenceMatcher and .get_close_matches, but what I'd really
>>> > like is something like:
>>> >
>>> > a = FindApprox("beatles", "beatlemania")
>>> > print a
>>> > 0.857
>>> >
>>> > Any ideas?
>>>
>>> You got no ideas from googling "approximate string search python"???
>>
>> Yes, many including agrepy and soundex in addition to those I
>> mentioned already, but none seem really handy at approximately looking
>> up smaller strings in larger ones. I also note that this has been the
>> topic of prior discussion without resolutiuon.
>>
>>    jab
> 
> It helps if you tell all that you've done. Otherwise people will tell 
> you to do what you've done already :-)
> 
> It helps if you reply on-list so that others can see. You may get better 
> answers sooner. I have to vanish now. Will check back tonight.
> 

Here's a possibly better answer:

def approx_search(pattern, text, max_dist):
     """
     Return a generator which yields tuples (endpos, dist)
     for each possible endpos such that
     (1) Levenshtein_distance(pattern, text[startpos:endpos]) = dist
     (2) 0 <= dist <= max_dist
     for some (undetermined) startpos.
     Not restricted to strings:
     (a) pattern must support pattern[i] and len(pattern).
     (b) text needs only to support enumerate(text).
     (c) contents of pattern and text need only support the != operator.
     Example:
     list(approx_search('Beatles', 'beetles Beat leverage Betelguese', 
3)) ->
     [(6, 3), (7, 2), (8, 3),
     (12, 3), (13, 3), (14, 3), (15, 2), (16, 2), (17, 3),
     (26, 3), (27, 3)]
     """
     plen = len(pattern)
     prange = range(plen)
     dprev = range(plen+1)
     dcurr = [0] * (plen+1)
     dist = plen
     for tx, tc in enumerate(text):
         # dcurr[0] = 0 # not needed, never changes
         for px in prange:
             dist = dprev[px] + (tc != pattern[px])
             temp = dcurr[px] + 1
             if temp < dist: dist = temp
             temp = dprev[px+1] + 1
             if temp < dist: dist = temp
             dcurr[px+1] = dist
         if dist <= max_dist:
             yield tx+1, dist
         dprev, dcurr = dcurr, dprev

If you need/want to poke around discovering "startpos", then you need 
something like the following, which is similar to the function by Magnus 
Lie Hetland which you'll find on the web, but appears to be about twice 
as fast without destroying legibility completely :-)

def Levenshtein_SJM(aseq, bseq):
     """
     Calculate Levenshtein distance between aseq and bseq.
     Space O(min(m,n))
     Time O(mn)
     """
     alen = len(aseq)
     blen = len(bseq)
     if alen > blen:
         aseq, bseq = bseq, aseq
         alen, blen = blen, alen
     if not alen: return blen
     arange = range(alen)
     dprev = range(alen+1)
     dcurr = [0] * (alen+1)
     for bx in xrange(blen):
         bc = bseq[bx]
         dcurr[0] = bx + 1
         for ax in arange:
             dist = dprev[ax] + (bc != aseq[ax])
             temp = dcurr[ax] + 1
             if temp < dist: dist = temp
             temp = dprev[ax+1] + 1
             if temp < dist: dist = temp
             dcurr[ax+1] = dist
         dprev, dcurr = dcurr, dprev
     return dist

Note that both of the above functions use the ancient original 
algorithm, which takes O(mn) time. For heavy lifting a modern algorithm 
and use of Pyrex or C are indicated.

HTH,
John