re.search much slower then grep on some regular expressions

Mon Jul 7 20:48:15 EDT 2008

On Jul 8, 2:51 am, Henning Thornblad <Henning.Thornb... at gmail.com>
wrote:
> When trying to find an alternative way of solving my problem i found
> that running this script:
>
> #!/usr/bin/env python
>
> import re
>
> row=""
> for a in range(156000):
>     row+="a"
> print "How many, dude?"
> print re.search('/[^ "=]*',row)  (the / has moved)
>
> wouldn't take even a second (The re.search part of course)
>
> This is for me very strange since this,
> in theory, is the same problem as before.

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**2)
process. If Y were instead something more complicated, it would be
worse than O(N**2).

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
     '[^ "=/]*/'

Cheers,
John