Efficient String Lookup?

Sun Oct 17 06:11:11 EDT 2004

Chris S. wrote:
> Spoke too soon.

As did I.  :)

> However, I noticed rex still doesn't return multiple matches. For 
> instance, matching 'abc' to the given the patterns '#bc', 'a#c', and 
> 'ab#', your code only returns a match to the first pattern '#bc'. Is 
> this standard behavior or is it possible to change this?

This is standard behavior.  You can't change it.  The
only easy solution along these lines is to have a triangular
table of

   (pat1)|(pat2)| .... |(patN)
   (pat2)| .... |(patN)
   ...
   (patN)

and if group i matched at a point x then do another
search using the (i+1)th entry in that table at that
point.  Repeat until no more matches at x.

I don't know of any off-the-shelf solution for Python
for what you want to do, other than the usual "try
each pattern individually."  You'll need to make some
sort of state table (or trie in your case) and do it
that way.

You *can* use Perl's regexps for this sort of thing.  That
regexp language allowed embedded Perl code, so this will
get you an answer

% perl -ne 'while (/((.bc)(?{print "Match 1 at ", length($`), 
"\n"})^)|((a.c)(?{print "Match 2 at ", length($`), "\n"})^)|./g){}'
This is abc acbcb
Match 1 at 8
Match 2 at 8
Match 1 at 13

Breaking the pattern down I'm using "while (/ ... /g) {}" to
match everything in the input string ($_), which comes from
each line of stdin (because of the '-n' command-line flag).

The pattern is

  ((.bc)(?{print "Match 1 at ", length($`), "\n"})^)
|((a.c)(?{print "Match 2 at ", length($`), "\n"})^)
|.

That is, match ".bc" then execute the corresponding piece
of embedded Perl code.  This prints "Match 1 at " followed
by the length of the text before the current match, which
corresponds to the position of the match.

(If you only need that there is a match, you don't need
then.  Using $` in perl gives a performance hit.)

After it executes, the subgroup matches (embedded executable
code always passes, and it consumes no characters).  But
then it gets to the '^' test which fails because this is
never at the beginning of the string.

So the regexp engine tries the next option, which is the
"Match 2 at .." test and print.  After the print (if
indeed there is a match 2) it also fails.

This takes the engine to the last option which is the "."
character.  And that always passes.

Hence this pattern always consumes one and only one character.
I could put it inside a (...)* to match all characters,
but decided instead to use the while(/.../g){} to do the looping.
Why?  Old practices and not well-determined reason.

(The while loop works because of the 'g' flag on the pattern.)

You talk about needing to eek all the performance you can
out of the system.  Have you tried the brute force approach
of just doing N regexp tests?

If you need the performance, it's rather easy to convert
a simple trie into C code, save the result on the fly
to a file, compile that into a Python shared library, and
import that library, to get a function that does the
tests given a string.  Remember to give a new name to
each shared library as otherwise the importing gets confused.

				Andrew
				dalke at dalkescientific.com