Using regexes versus "in" membership test?

Wed Dec 12 20:09:33 EST 2012

On Wed, 12 Dec 2012 14:35:41 -0800, Victor Hooi wrote:

> Hi,
> 
> I have a script that trawls through log files looking for certain error
> conditions. These are identified via certain keywords (all different) in
> those lines
> 
> I then process those lines using regex groups to extract certain fields.
[...]
> Also, my regexs could possibly be tuned, they look something like this:
> 
>     (?P<timestamp>\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P<logginglevel>\w+)\s*
\]\s*\[(?P<module>\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P<function>\w+)\s*\]
\s*level\(\d\) broadcast\s*\(\[(?P<f1_instance>\w+)\]\s*\[(?P<foo>\w+)\]
\s*(?P<bar>\w{4}):(?P<feedcode>\w+) failed order: (?P<side>\w+) (?
P<volume>\d+) @ (?P<price>[\d.]+), error on update \(\d+ : Some error 
string. Active Orders=(?P<value>\d+) Limit=(?P<limit>\d+)\)\)
>
> (Feel free to suggest any tuning, if you think they need it).

"Tuning"? I think it needs to be taken out and killed with a stake to the 
heart, then buried in concrete! :-)

An appropriate quote:

    Some people, when confronted with a problem, think "I know, 
    I'll use regular expressions." Now they have two problems.
    -- Jamie Zawinski

Is this actually meant to be a single regex, or did your email somehow 
mangle multiple regexes into a single line?

At the very least, you should write your regexes using the VERBOSE flag, 
so you can use non-significant whitespace and comments. There is no 
performance cost to using VERBOSE once they are compiled, but a huge 
maintainability benefit.

> My question is - I've heard that using the "in" membership operator is
> substantially faster than using Python regexes.
> 
> Is this true? What is the technical explanation for this? And what sort
> of performance characteristics are there between the two?

Yes, it is true. The technical explanation is simple:

* the "in" operator implements simple substring matching, 
  which is trivial to perform and fast;

* regexes are an interpreted mini-language which operate via
  a complex state machine that needs to do a lot of work,
  which is complicated to perform and slow.

Python's regex engine is not as finely tuned as (say) Perl's, but even in 
Perl simple substring matching ought to be faster, simply because you are 
doing less work to match a substring than to run a regex.

But the real advantage to using "in" is readability and maintainability.

As for the performance characteristics, you really need to do your own 
testing. Performance will depend on what you are searching for, where you 
are searching for it, whether it is found or not, your version of Python, 
your operating system, your hardware.

At some level of complexity, you are better off just using a regex rather 
than implementing your own buggy, complicated expression matcher: for 
some matching tasks, there is no reasonable substitute to regexes. But 
for *simple* uses, you should prefer *simple* code:

[steve at ando ~]$ python -m timeit \
> -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \
> "'xyz' in data"
100000 loops, best of 3: 4.17 usec per loop

[steve at ando ~]$ python -m timeit \
> -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \
> -s "from re import search" \
> "search('xyz', data)"
100000 loops, best of 3: 10.9 usec per loop

> (I couldn't find much in the way of docs for "in", just the brief
> mention here -
> http://docs.python.org/2/reference/expressions.html#not-in )
> 
> Would I be substantially better off using a list of strings and using
> "in" against each line, then using a second pass of regex only on the
> matched lines?

That's hard to say. It depends on whether you are matching on a substring 
that will frequently be found close to the start of each line, or 
something else.

Where I expect you *will* see a good benefit is:

* you have many lines to search;
* but only a few actually match the regex;
* the regex is quite complicated, and needs to backtrack a lot;
* but you can eliminate most of the "no match" cases with a simple
  substring match.

If you are in this situation, then very likely you will see a big benefit 
from a two-pass search:

for line in log:
    if any(substr in line for substr in list_of_substrings):
        # now test against a regex

Otherwise, maybe, maybe not.

-- 
Steven