re.search much slower then grep on some regular expressions

Carl Banks pavlovevidence at gmail.com
Sat Jul 5 08:54:48 EDT 2008


On Jul 5, 6:44 am, "Sebastian \"lunar\" Wiesner"
<basti.wies... at gmx.net> wrote:
> Carl Banks <pavlovevide... at gmail.com>:
>
>
>
> > On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
> > <basti.wies... at gmx.net> wrote:
> >> Paddy <paddy3... at googlemail.com>:
>
> >> > On Jul 4, 1:36 pm, Peter Otten <__pete... at web.de> wrote:
> >> >> Henning_Thornblad wrote:
> >> >> > What can be the cause of the large difference between re.search and
> >> >> > grep?
>
> >> >> grep uses a smarter algorithm ;)
>
> >> >> > This script takes about 5 min to run on my computer:
> >> >> > #!/usr/bin/env python
> >> >> > import re
>
> >> >> > row=""
> >> >> > for a in range(156000):
> >> >> > row+="a"
> >> >> > print re.search('[^ "=]*/',row)
>
> >> >> > While doing a simple grep:
> >> >> > grep '[^ "=]*/' input                  (input contains 156.000 a in
> >> >> > one row)
> >> >> > doesn't even take a second.
>
> >> >> > Is this a bug in python?
>
> >> >> You could call this a performance bug, but it's not common enough in
> >> >> real code to get the necessary brain cycles from the core developers.
> >> >> So you can either write a patch yourself or use a workaround.
>
> >> >> re.search('[^ "=]*/', row) if "/" in row else None
>
> >> >> might be good enough.
>
> >> >> Peter
>
> >> > It is not a smarter algorithm that is used in grep. Python RE's have
> >> > more capabilities than grep RE's which need a slower, more complex
> >> > algorithm.
>
> >> FWIW, grep itself can confirm this statement.  The following command
> >> roughly takes as long as Python's re.search:
>
> >> # grep -P '[^ "=]*/' input
>
> >> -P tells grep to use real perl-compatible regular expressions.
>
> > This confirms that a particular engine might not be optimized for it,
> > but it's not necessarily a reflection that the engine is more complex.
>
> My posting wasn't intended to reflect the differences in complexity between
> normal GNU grep expressions (which are basically extended POSIX
> expressions) and perl-compatible expressions.  The latter just _are_ more
> complex, having additional features like look aheads or non-greedy
> qualifiers.
>
> I just wanted to illustrate, that the speed of the given search is somehow
> related to the complexity of the engine.

I don't think you've illustrated that at all.  What you've illustrated
is that one implementation of regexp optimizes something that another
doesn't.  It might be due to differences in complexity; it might not.
(Maybe there's something about PCREs that precludes the optimization
that the default grep uses, but I'd be inclined to think not.)


Carl Banks



More information about the Python-list mailing list