Seeking regex optimizer

Tue Jun 20 07:22:44 EDT 2006

Kay Schluehr replied to my question:
> > Why do you want to use a regex for this?
>
> Because it is part of a tokenizer that already uses regexps and I do
> not intend to rewrite / replace it.

Switching to pytst is not a big change - there will be little
impact on the rest of your code.  On the other hand, it does
change your build and deployment process because of the
need for another package.

Because you are constraining yourself to regex there isn't
much of an improvement you can do.

> I found that certain groups of token
> might be represented in a more compact mannerr: for matching ['+',
> '++'] one might generate '\+|\+\+' or '\+\+?' and I wanted to know if
> there is some generic approach to solve the "inverse regexp" problem in
> a non-trivial fashion.

There are a few questions which come up.  Why do you want to
have a "more compact" representation?  Do you think it will
make the result faster?  Likely it will because of the reduced
need for backtracking but unless you have a lot of strings with
the same prefix then the number of backtracks should go as
something like the log of the number of words, which scales
pretty well.

Are you speed limited by the existing implementation?  Likely
not.  In which case the performance isn't an issue.  Consider
this anecdote from http://www.technicat.com/writing/programming.html

    I was asked in a phone interview with Google how I would
    search for a number in an array of ordered numbers.
    Obviously, the questioner was asking for a CS 101 answer -
    binary search. But in real life, I would probably do the "wrong"
    thing - search the array from beginning to end. There's no point
    in taking twice the time to write twice as much code that has
    to be maintained and debugged if the application performance
    is good enough, and in particular if that piece of code is not
    the bottleneck in the application. (And I seriously doubt you'd
    have that data laid out linearly in a fixed array like that if it
was
    the bottleneck)

As it turns out, the regexp implementation could do the optimization
you want.  That is, it could recognize that the input sequence is
a set of fixed words (no special characters) and implement something
like Aho-Corasick, in which case you wouldn't need to change
your input.

  I don't know if Python's regexp implementation does this already.
Neither do you.  Test the functionality and performance first
before going starting other work.

I am not the first to give you this advice:

John Machin:
> Optimised with respect to what? speed? ease of construction?

gry at ll.mit.edu:
> As others have suggested, you should first try the most naive
> implementation before making a hard optimization problem out of this.

  As for the general question of the existance of a "generic approach
to solve the "inverse regexp" problem in a non-trivial fashion."

  Yes, there is.  Kinda of.  All (formal/theoretical) regular
expressions
can be written as a regular expression without need for backtracking.
If you look up the theory of regular expressions, backtracking occurs
when there are ambiguities in traversing the state machine.  These
are called nondeterministic finite state machines, or NFAs.  It turns
out that every NFA can be written as a DFA, which has no need for
backtracking.

  Except.  Actually, two excepts.  The normal algorithm for converting
an NFA to a DFA doesn't support grouping, which you'll need to
figure out which group matched.  Wikipedia says there is an alternate,
slower algorithm which does support groups.

  The second except is that regexps in Python, Perl, etc. are not
regular expressions.  Some patterns, like r"([A-Z][a-z]+)\1", which
matches "WikiWiki", are not "regular grammars" under the definition
used in language theory.  It's possible to use regexps to match
strings of prime length, for example.

  So in the general case there is no solution.  There are many
possible optimizations, but no general solution.

  For your problem, which is a limited subset of the problem as
it involves only constant strings, then the solution is a prefix
tree, also called a trie.  See http://en.wikipedia.org/wiki/Prefix_tree

  John Machin in an earlier post suggested you search along these
lines when he said
> I would have thought the way to approach this would be a simple
> character-by-character tree/trie-driven lookup. This would be worst case
> O(n) where n is the length of the longest string in your list. There may
> well be a Python-callable gadget for this on the net somewhere. Google
> "Danny Yoo ahocorasick" for a Python-callable solution to a similar but
> more complex problem.

For that matter I suggested it when I pointed you to pytst.  The tst
means "ternary search tree" (some write it "ternary search trie") which
is a different implementation of the same idea.

Your posts read like you ignored those pointers because they didn't
fall into the category "implemented with a regular expression".  If
that's the case then no, there is no solution.  You're letting a sense
of purity overwhelm practicality.

If you are really interested in this problem, try taking a CS
course titled something like "Automata Theory" or "Foundations
of Computer Science".  The one which covers DFAs, NFAs,
CFGs, Turing machines, etc.  It's the CS class I enjoyed the most.

                                Andrew
                                dalke at dalkescientific.com