Case-insensitive string comparison

Mike C. Fletcher mcfletch at rogers.com
Tue Apr 15 10:46:25 EDT 2003


Avner Ben wrote:
...

>    I am searching for one or more members of a list of names (that may get
>near 1000 pieces in a normal application and may average some 40 characters)
>in a string (that may average 100 characters). For each name found in the
>string, I do something to the string, using the original name (in its proper
>case). At present, I am looking for the exact name as stated, with agreeable
>performance. However, this makes me miss many cases where the author of the
>string did not care for the casing (which happens frequently). I have little
>problem with either lowering or uppering the entire list once, because it is
>not expected to change often and when that happens once in a while, a
>subscription mechanism can take care of that. However, this requires holding
>two lists (I still need the original names as are) and lowering or uppering
>each string for comparison (and keep its original version for
>restructuring).
>
>    Obviously, the efficient algorithm here would be to skip the standard
>comparisons altogether and write an algorithm that advances on the string
>character by character and - for each - attempts to parse using the list.
>
...

Sure, but as you discovered, it's not the best way in Python.  I'd guess 
that if you analysed it it's a pretty bad algorithm for C too.  Unless 
your comparison function is such that it needs to adapt to *both* words 
in a pair, it's far more time-efficient to do the translation of each 
lexicon word once, cache the results (trading space) and use them.  
Upper/lower casing is a pretty trivial task, so that might be swamped by 
the setup/teardown cost, but the more involved your fuzziness gets the 
more the iterative approach becomes a liability without some pre-setup.

lexicon = { # translated:[ matching, words]
}

def lookup( word ):
    """Return the lexicon words which match word"""
    key = translate( word )
    return lexicon.get( key )
def addWord( word ):
    """Add a new word to the lexicon"""
    key = translate( word )
    lexicon.setdefault( key, []).append( word )
    return key

That approach has the neat benefit of allowing phonetic compression (or 
the like) at some later stage, which lets you get even more "fuzzy" in 
your matching.  Using the bisect module with (translated,original) pairs 
is a useful modification if you eventually need ordering of the 
word-list (or scanning of an entire range of words (useful in 
larger-distance fuzzy searching)).

Anyway, enjoy yourself,
Mike

_______________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://members.rogers.com/mcfletch/








More information about the Python-list mailing list