Help Optimizing Word Search

Tue Jan 11 20:12:27 EST 2005

Case  Nelson wrote:
> Hi there I've just been playing around with some python code and I've
> got a fun little optimization problem I could use some help with.
>
> Basically, the program needs to take in a random list of no more than
> 10 letters,  and find all possible mutations that match a word in my
> dictionary (80k words). However a wildcard letter '?' is also an
> acceptable character which increases the worst case time
significantly.
> So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc,
ca,
> cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
> added to the dict of words. If the letters are ['?','?'] check a-z,
aa,
> ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz

This appears to be a Computer Science 101 Data Structures and
Algorithms question, not a Python question, but here's an answer
anyway:

You appear to want to find all words that have one or more letters in
common with your query or candidate string.

Aside: Have you been following the long thread started by the poster
who appeared to want to store all possible strings that were _not_
words in a given language but could be generated from its alphabet?

Here's a possibility: use a bit mask approach. You attach a bit mask to
each word; simple data structure -- a list of 2-tuples, or two parallel
lists.

!def mask(word):
!   m = 0
!   for letter in word:
!       m |= 1 << (ord(letter) - ord('a'))
!   return m

Searching without wildcards:

!def nowc_search(candidate, mylistof2tuples):
!    candmask = mask(candidate) # treating candidate as str, not list
!    for word, mask in mylistof2tuples:
!        if mask & candmask:
!           # one or more letters in common
!           yield word

Note: this treats "mississippi" and "misp" the same. If "aa" is in your
dictionary, what queries would retrieve it? Depending on your exact
requirements, this technique may suit you, or you may want to use it as
a fast(?) filter, with the matches it throws up needing further
checking. You may need a "count number of bits that are set in an int"
function.

Ref: Fred J. Damerau, "A technique for computer detection and
correction of spelling errors", CACM vol 7 number 3, March 1961.

Searching with wild cards: your example of query == "??" seems to yield
all two-letter words. I'd like to see what you expect for "a?", "?a",
"ab?", and "aa?" before suggesting how to tackle wild cards.
Reverse-engineering requirements out of other folks' code is not
something I do for fun :-)

An alternative for you to think about: instead of a bitmask, store the
letter-sorted transformation of the words: cat->act, act->act,
dog->dgo, god->dgo.

Alternative data structure: key = bitmask or sorted-letters, value =
list of all words that have that key.

A further suggestion which should always be considered when setting up
a search where the timing is worse than average O(1): have a separate
dictionary for each different wordlength, or some other
impossible-length-avoidance filter; that way, with minimum preliminary
calculation you can avoid considering words that are so long or so
short that they cannot possibly be matches. For example, with
approximate matching based on edit distance, if you are searching for a
10-letter word allowing for 2 errors, you can avoid doing the
complicated comparison on words shorter than 8 or longer than 12.
HTH,
John