Help Optimizing Word Search

Wed Jan 12 00:02:48 EST 2005

First of all thank you very much for the reply. I hope I'm not too
verbose here, just hope that if someone else runs into a similar
problem they can find an answer here.

> This appears to be a Computer Science 101 Data Structures and
> Algorithms question, not a Python question, but here's an answer
> anyway

Heh, it's been awhile since I took data structures 101 but I asked here
after reading about python generators, and although I can't quite get
my head around them, I have a sense that they could help me check all
mutations (is this the right word?) of the candidate strings.

Thanks for the ideas on finding matches, I will have to try the bit
filter out to see if its faster. The trie works similar to this in that
it will return a 3  if the substring is not in the dictionary so that
you don't have to check any superset of the mutation. Although to
clarify, the reason I posted was mostly for help on my weak wildcard
algorithm.

> If "aa" is in your dictionary, what queries would retrieve it?

Any query that contains either "a" and "a" or "a" and "?" or "?" and
"?". So some examples would be:
"aa", "a?", "aba", "baa", "bbbaba", "b?b?", "bab?", "?abbbbb" etc...

> Searching with wild cards: your example of query == "??" seems to
yield
> all two-letter words. I'd like to see what you expect for "a?", "?a",
> "ab?", and "aa?" before suggesting how to tackle wild cards.
> Reverse-engineering requirements out of other folks' code is not
> something I do for fun :-)

My apologies, in my haste my example of what query == "??" yields was
unclear. Every list of candidate letters should yield every unique (if
possible) mutation, of length 1 to len(candidate).

Heh, exactly what each of those strings would yield is pretty much the
root of my problem:

?? yields (the order of the results dont matter):
a, b, c, d, e, f, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb,
bc, ..., ya, yb, yc, ..., yx, yy, yz, za, zb, zc, ..., zx, zy, zz

a? yields
a, b, c, d, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, ca, da, ...,
xa, ya, za

?a yields same as a?

ab? yields
a, b, c, d, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb, bc, ...,
bx, by, bz, ca, cb, da, db, ea, eb, ..., xa, xb, ya, yb, za, zb, aba,
abb, abc, abd, ..., abx, aby, abz, baa, bab, bac, ..., bax, bay, baz,
aab, acb, adb, ..., axb, ayb, azb, bba, bca, bda, ... bxa, bya, bza,
cba, dba, eba, xba, yba, zba

HTH, basically every possible string out of the given pool of up to 10
letters.

Ok writing these out just gave me an idea on how I could try and write
this another way.

Ok done. Wow, this program actually ran much much slower than my
original 135 seconds vs 9 seconds!! I'm not sure if I made a huge
mistake or if it's because in the original I can stop searching
superset-candidates if the candidate is not a substring of a dictionary
word (in the same way the mask would reject impossible mutations).

This produces the correct mutations however (although not unique as I
wrote out above):

-- begin code --
!def getmutations(pool):
!
!    for i, letter in enumerate(pool):
!        pool2 = pool[:]
!        pool2.pop(i)
!
!        for mutation in getmutations(pool2):
!            if letter != '?':
!                yield letter + mutation
!            else:
!                for c in alpha:
!                    yield c + mutation
!
!    yield ""
!letters = list('abae?s?')
!s = time.time()
!output = open('words.log','w')
!for candidate in getmutations(letters):
!    # Check if word is valid
!    result = mytrie.query(candidate)
!
!    # Found a word, add it to the database
!    if result == 1:
!        words[candidate] = 1
!        
!print time.time() - s
--- end code ---