searching algorithm

ciju mail.ciju.cherian at gmail.com
Fri May 11 01:29:08 EDT 2007


On May 11, 3:12 am, Neil Cerutti <horp... at yahoo.com> wrote:
> On 2007-05-10, Gordon Airporte <JHoo... at fbi.gov> wrote:
>
>
>
>
>
> >> For the above (abrideged) dictionary, you would generate (use a
> >> fixed-width "programmers" font so the tree looks good):
>
> >>                  a
> >>                  |
> >>                  b
> >>                  |
> >>                  s
> >>                 / \
> >>                i   o
> >>               /   / \
> >>              n   l   r
> >>             /   / \   \
> >>            t   u   v   b->(absorbirati, crpisti)
> >>           /    |   |
> >> (pelin)<-h     t   e->(odrije?iti, osloboditi)
> >>          |     |
> >> (pelin)<-e     e->(apsolutan, apsolutni kod)
>
> >> As the user enter letters, you just march down the tree, printing
> >> all the words held in leaf nodes held in the current node.
>
> > Call me dense, but how does one do this in Python - which
> > doesn't have pointers? Dictionaries with dictionaries within
> > dictionaries... (with each letter as the key and the its
> > children as values) is going to be extremely space inefficient,
> > right?
>
> Unfortunately, I don't know the space tradeoffs in Python
> offhand. Lists and tuples make excellent trees.
>
> The above might be stored as follows:
>
> Every node is a tuple of its letter, a list of its children, and
> a list of its words. So the two 'pelin' nodes would be (with 'e'
> referenced in the 'h' node):
>
> ('h', [('e', [], ['pelin'])], ['pelin'])
>
> That would in turn be "stored" in the t, n, i and s nodes.
>
> ('s',
>  [('i',
>   [('n',
>    [('t',
>     [('h', [('e', [], ['pelin'])], ['pelin'])
>      [])]
>     [])]
>    []), ('o' trie (thanks Terry) omitted for my sanity)])
>
> It's a lot harder to write by hand than it would be to use.
>
> My intuition says it shouldn't be terribly hard on resources for
> for a 180K dictionary, but I could be wrong. I'm too lazy to
> measure. ;)
>
> If it does turn out to be unreasonably huge, then you'd fall back
> on solving the problem completely with a binary search returning
> a range (I'm not sure of the name), which would be more expensive
> at run time, but might be fast enough, and would use a minimal
> amount of 'resources'.
>
> --
> Neil Cerutti

The problem that I see here is that each time user types a letter, all
the leaf nodes in the subtree would have to be traversed again. This
computation was already done for all the letters user typed before the
last one.

My suggestion would be to have two data structures. The first one
similar to the tree mentioned above, but leaf nodes need not have
Croatian translations, and each node should also have total number of
leafs in both left siblings and right siblings of that node. The
second data structure would be list of English:Croatian word pairs.

As the user types a new letter, we just have to remove(not show), the
number of leaf elements in the left and/or right siblings, from left
and/or right of the second data structure. So we just have to keep
track of the node we r currently at(in the tree) and the start, end
index for the second data structure(basically the list to be shown to
the user).

By the way, both data structures could be implemented as tuple in
python, for I suppose, if only lookup is needed tuple gives better
performance than list.

ciju




More information about the Python-list mailing list