newb: comapring two strings

Paul McGuire ptmcg at austin.rr._bogus_.com
Fri May 19 13:31:58 EDT 2006


"John Machin" <sjmachin at lexicon.net> wrote in message
news:1148038726.023455.145900 at j73g2000cwa.googlegroups.com...

> In that case, the problem can be solved in O(n) time by a simple loop
> which counts the number of differences and notes the position of the
> first (if any) difference. Any full-distance Levenshtein method that
> does no pre-processing must take O(n**2) time. It is possible to take
> advantage of the fact that the OP doesn't care about what the distance
> is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc.
>

Here is a class that implements 4 different approaches to this comparison
problem, with a configurable number of maximum mismatches (where a and b are
the strings being compared):

1. use sum(map(lambda (x,y): x!=y, itertools.izip(a,b))) to count
mismatches - no short-circuiting
2. use sum(map(abs, itertools.starmap(cmp, itertools.izip(a,b)))) to count
mismatches - no short-circuiting
3. use explicit for loop to compare each tuple returned from
itertools.izip(a,b) - short-circuits when number of mismatches exceeds
allowable number
4. use for loop over itertools.takewhile to count mismatches -
short-circuits when number of mismatches exceeds allowable number

Also note the short-circuiting in the initializer - if the strings being
compared are shorter in length then the number of allowed mismatches, than
they will always match, no comparisons required.  (In the OP's specific
case, any two one-character strings will match within one character).

Of course this does nothing to address the larger issue of the program
design - I just wanted to learn a little more about itertools. :)

-- Paul


from itertools import izip,starmap,takewhile

class offByNoMoreThanOneCharacter(object):
    def __init__(self,a,b,maxMismatches=1):
        len_a = len(a)
        self.val = None
        if len_a != len(b):
            self.val = False
        elif len_a <= maxMismatches:
            self.val = True
        else:
            self.a = a
            self.b = b
            self.maxMismatches = maxMismatches

    def eval1(self):
        # one-liner using map with lambda - sum counts mismatches
        self.val = sum(map(lambda (x,y): x!=y, izip(self.a,self.b))) <=
self.maxMismatches

    def eval2(self):
        # one-liner using map with abs and cmp - sum counts mismatches
        self.val = sum(map(abs, starmap(cmp, izip(self.a,self.b)))) <=
self.maxMismatches

    def eval3(self):
        # explicit traversal, with short-circuit when too many mismatches
found
        mismatches = 0
        for (ac,bc) in izip(self.a,self.b):
            if ac != bc:
                mismatches += 1
                if mismatches > self.maxMismatches:
                    self.val = False
                    break
        else:
            self.val = True

    def eval4(self):
        # traversal using takewhile, also short-circuits when too many
mismatches found
        numMismatches = 0
        maxMismatches = self.maxMismatches
        stillOk = lambda (s,t)=(None,None) : numMismatches <= maxMismatches
        for (ac,bc) in takewhile(stillOk, izip(self.a,self.b)):
            numMismatches += (ac != bc)
        self.val = stillOk()


    # special instance method to return "boolean-ness" of this object
    def __nonzero__(self):
        if self.val is None:
            # change this line to use whichever eval method you want to test
            self.eval1()
        return self.val


def test(a,b):
    print a,b
    print bool(offByNoMoreThanOneCharacter(a,b))
    print

test("abc","abd")
test("abc","axd")
test("yaqtil","yaqtel")
test("yiqtol","yaqtel")





More information about the Python-list mailing list