searching algorithm

Thu May 17 11:50:52 EDT 2007

On 2007-05-11, Terry Reedy <tjreedy at udel.edu> wrote:
>
> "Neil Cerutti" <horpner at yahoo.com> wrote in message 
> news:lNM0i.34412$G23.27437 at newsreading01.news.tds.net...
>| Every node is a tuple of its letter, a list of its children, and
>| a list of its words. So the two 'pelin' nodes would be (with 'e'
>| referenced in the 'h' node):
>|
>| ('h', [('e', [], ['pelin'])], ['pelin'])
>|
>| That would in turn be "stored" in the t, n, i and s nodes.
> [snip]
>
> At the outer level, I would use a list in order to build the
> structure in pieces, one for each letter, and then add them in.
>
> At the top level, the letters do not need explicit storage.
> The first subtree is for words starting with 'a', etc.  In
> other words, the position of each subtree indicates its
> starting letter.  For most letters, this can be carried on
> another level.

Here's a proof of concept prototype of the dictionary searching
algorithm. The dictionary I found to test with has roughly 5,000
words, and I didn't attempt any optimizations at all.  I'm
guessing it can't be sped up very much, even though I wrote the
simplest possible implementation. Lookup in the worst case is
O(N), where N is the number of letters in your prefix.

First, a sample of input/output.

lookup: pe
people: la gente[Noun]
pepper: pepe[Noun]
peppercorn: grano di pepe[Noun]
peppermint: menta peperita[Noun]
peppery: pepato, acre, pungente[Adjective]
pepsin: pepsina[Noun]
permutation: permutazione[Noun]
permutations: permutazioni[Noun]
permute: permutare[Verb]
perorate: perorare
perpendicular: perpendicolare[Adjective]
perpendicularly: perpendicolarmente[Adverb]
perpendiculars: perpendicolari[Noun]
perpetrate: perpetrare[Verb]
perpetrated: perpetrato[Verb]
perpetual: perpetuo[Adjective]
petard: petardo[Noun]
petroleum: petrolio[Noun]

Now the source.

# Each node is a tuple of (letter, node list, word list). A word
# list is just a list of strings.
root = ('', [], [])

def insert(node, key, value):
    if len(key) == 0:
        node[2].append(value)
    else:
        first = key[0]
        rest = key[1:]
        for n in node[1]:
            if n[0] == first:
                insert(n, rest, value)
                return
        node[1].append((first, [], []))
        insert(node, key, value)

def lookup(node, key):
    if len(key) == 0:
        return node
    else:
        first = key[0]
        rest = key[1:]
        for v in node[1]:
            if v[0] == first:
                return lookup(v, rest)
        return None

def word_list(node, word):
    for v in node[2]:
        print '%s: %s' % (word, v)
    for n in node[1]:
        word_list(n, word+n[0])

# Italian.txt from http://www.june29.com/IDP/
source = open('Italian.txt', 'r')

# Build tree
for line in source:
    line = line.strip()
    if line[0] == '#': continue
    key, value = line.split('\t')
    insert(root, key, value)

# Look up a prefix
x = raw_input("lookup: ")

tree = lookup(root, x)
if tree:
    word_list(tree, x)
else:
    print "Not found."

-- 
Neil Cerutti