Efficient String Lookup?

Josiah Carlson jcarlson at uci.edu
Sun Oct 17 04:05:57 EDT 2004


> Sorry for the ambiguity. My case is actually pretty simple. '#' 
> represents any single character, so it's essentially the same as re's 
> '.'. The match must be exact, so the string and pattern must be of equal 
> lengths. Each wildcard is independent of other wildcards. For example, 
> suppose we restricted the possible characters to 1 and 0, then the 
> pattern '##' would only match '00', '01', '10', and '11'. This pattern 
> would not match '0', '111', etc. I feel that a trie would work well, but 
> I'm concerned that for large patterns, the overhead in the Python 
> implementation would be too inefficient.

Having implemented what is known as a burst trie (a trie where you don't
expand a branch until it has more than 'k' entries) in Python, it ends
up taking up much more space, but that isn't really an issue unless you
have large numbers of strings (millions), or the strings are long
(kilobytes).

If you want to make it more efficient (space-wise), write the algorithm
and structures in pure C, then wrap it with SWIG.  Add options for
inserting and deleting strings, and also querying for strings that match
a certain pattern.

Thinking about it, if your dictionary is very restricted, you could just
toss all strings in a balanced search tree, doing a similar tree
traversal as the trie solution.  Much less overhead, most of the
benefits.

 - Josiah




More information about the Python-list mailing list