Regex Speed

Wed Feb 21 04:10:29 EST 2007

On Feb 21, 12:14 pm, Pop User <popu... at christest2.dc.k12us.com> wrote:
> garri... at gmail.com wrote:
> > While creating a log parser for fairly large logs, we have run into an
> > issue where the time to process was relatively unacceptable (upwards
> > of 5 minutes for 1-2 million lines of logs). In contrast, using the
> > Linux tool grep would complete the same search in a matter of seconds.
>
> Its very hard to beat grep depending on the nature of the regex you are
> searching using. The regex engines in python/perl/php/ruby have traded
> the speed of grep/awk for the ability to do more complex searches.
>
> http://swtch.com/~rsc/regexp/regexp1.html
>
> This might not be your problem but if it is you can always popen grep.
>
> It would be nice if there were a Thompson NFA re module.

Or a Glushkov NFA simulated by bit parallelism re module ... see
http://citeseer.ist.psu.edu/551772.html
(which Russ Cox (author of the paper you cited) seems not to have
read).

Cox uses a "pathological regex" (regex = "a?" * 29 + "a" * 29, in
Python code) to make his point: grep uses a Thompson gadget and takes
linear time, while Python perl and friends use backtracking and go off
the planet.

The interesting thing is that in Navarro's NR-grep, that's not
pathological at all; it's a simple case of an "extended pattern" (? +
and * operators applied to a single character (or character class)) --
takes linear time with a much easier setup than an NFA/DFA and not
much code executed per byte scanned.

Getting back to the "It would be nice ..." bit: yes, it would be nice
to have even more smarts in re, but who's going to do it? It's not a
"rainy Sunday afternoon" job :-)

Cheers,
John