re.search much slower then grep on some regular expressions

Wed Jul 9 22:20:41 EDT 2008

On Jul 9, 10:06 pm, Kris Kennaway <k... at FreeBSD.org> wrote:
> John Machin wrote:
> >> Hmm, unfortunately it's still orders of magnitude slower than grep in my
> >> own application that involves matching lots of strings and regexps
> >> against large files (I killed it after 400 seconds, compared to 1.5 for
> >> grep), and that's leaving aside the much longer compilation time (over a
> >> minute).  If the matching was fast then I could possibly pickle the
> >> lexer though (but it's not).
>
> > Can you give us some examples of the kinds of patterns that you are
> > using in practice and are slow using Python re?
>
> Trivial stuff like:
>
>            (Str('error in pkg_delete'), ('mtree', 'mtree')),
>            (Str('filesystem was touched prior to .make install'),
> ('mtree', 'mtree')),
>            (Str('list of extra files and directories'), ('mtree', 'mtree')),
>            (Str('list of files present before this port was installed'),
> ('mtree', 'mtree')),
>            (Str('list of filesystem changes from before and after'),
> ('mtree', 'mtree')),
>
>            (re('Configuration .* not supported'), ('arch', 'arch')),
>
>            (re('(configure: error:|Script.*configure.*failed
> unexpectedly|script.*failed: here are the contents of)'),
>             ('configure_error', 'configure')),
> ...
>
> There are about 150 of them and I want to find which is the first match
> in a text file that ranges from a few KB up to 512MB in size.
>
>  > How large is "large"?
>
> > What kind of text?
>
> It's compiler/build output.
>
> > Instead of grep, you might like to try nrgrep ... google("nrgrep
> > Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
> > postscript paper and C source findable from Gonzalo Navarro's home-
> > page.
>
> Thanks, looks interesting but I don't think it is the best fit here.  I
> would like to avoid spawning hundreds of processes to process each file
> (since I have tens of thousands of them to process).
>

Uh-huh ... try this, then:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

You could use this to find the "Str" cases and the prefixes of the
"re" cases (which seem to be no more complicated than 'foo.*bar.*zot')
and use something slower like Python's re to search the remainder of
the line for 'bar.*zot'.

Cheers,
John