Efficient String Lookup?

Win Carus wcarus at comcast.net
Sun Oct 17 23:01:58 EDT 2004


"Chris S." <chrisks at NOSPAM.udel.edu> wrote in message news:<dB5cd.387$B34.355 at trndny02>...
> I have a number of strings, containing wildcards (e.g. 'abc#e#' where # 
> is anything), which I want to match with a test string (e.g 'abcdef'). 
> What would be the best way for me to store my strings so lookup is as 
> fast as possible?
Regular expressions are probably not the simplest or most
straight-forward option here.

A rather simpler and more straight-forward approach is (a) pre-compute
all the possible strings (i.e., interpolate all possible wildcard
values) along with their associated values; and (b) use a dictionary.
Pre-computation in your case is simple because you only have
single-character wildcards. This has four benefits: it uses a standard
and highly optimized native Python data structure (the dictionary);
it's very fast; it's very simple to update or modify keys and values
("cargo"); and it identifies pattern collisions (i.e, patterns that
have two or more actions associated with them).

On the other hand, if the number of pre-computed strings is "too
large" and there is a lot of prefix and/or suffix similarity in the
patterns, you could consider using an automaton, trie, or a DAG
(directed acyclic graph). Any string-matching automaton or trie will
do automatic prefix pattern compression and will allow you to attach
arbitrary data to the matching state. If there is considerable suffix
similarity, you could also consider a DAG since it ties together
similar suffixes with identical cargo.

BTW: There is no reason to use an Aho-Corasick automaton for this
problem since you don't need its
find-all-matches-in-a-given-sequence-in-a-single-pass functionality
(and that functionality adds substantial complexity to automaton
generation).



More information about the Python-list mailing list