spell checking

dataangel k04jg02 at kzoo.edu
Sun Oct 31 03:33:00 EST 2004


I was just curious if there were any spell checker python modules around 
that can guess at what the user meant to type in. I wrote up a quick 
function that splices a string up into bigrams and then checks how many 
bigrams are identical to a given word, which I think is how google does 
it. But support for trigrams etc. could be added, so I'm curious if 
anyone out there has done something more. Here's the script:

def StringsSimilarity(str1, str2):
    """Divides the two strings into bigrams and reports
    what percentage of them are equal"""
    str1 = str1.strip().lower()
    str2 = str2.strip().lower()
    bigramStr1 = []
    bigramStr2 = []
    currentList = bigramStr1
    i = 0
    j = 0
   
    # Empty versus non empty strings are never similar
    if not (str1 and str2):
        return 0
   
    # 100% match if equal
    if str1 == str2:
        return 1.0

    # Make strings equal length, simplifies things
    len1 = len(str1)
    len2 = len(str2)
   
    if len1 > len2:
        str2 = str2 + " "*(len1-len2)
    elif len2 > len1:
        str1 = str1 + " "*(len2-len1)
       
    len1 = len(str1)
    len2 = len(str2)
   
    currentString = str1
   
    # Generate bigrams
    while j < 2:   
        i = 0
        while i < len1:
            if i+1 >= len1:
                currentList.append(currentString[i])
            else:
                currentList.append(currentString[i] + currentString[i+1])
               
            i += 2
       
        j += 1
        currentList = bigramStr2
        currentString = str2
   
    similarity = 0
   
    for i in range(len(bigramStr1)):
        if bigramStr1[i] == bigramStr2[i]:
            similarity += 1.0
           
    if similarity == 0:
        return 0
   
    return similarity/len(bigramStr1)
       
def StringsSimilar(str1, str2):
    """Using StringSimilarity, decides if the two
    strings score is good enough, 50%, to be
    considered similar"""
    return StringsSimilarity(str1, str2) >= 0.50




More information about the Python-list mailing list