Using regexes versus "in" membership test?

Thu Dec 13 01:10:23 EST 2012

Hi,

That was actually *one* regex expression...lol.

And yes, it probably is a bit convoluted.

Thanks for the tip about using VERBOSE - I'll use that, and comment my regex - that's a useful tip.

Are there any other general pointers you might give for that regex? The lines I'm trying to match look something like this:

    07:40:05.793627975 [Info  ] [SOME_MODULE] [SOME_FUNCTION] [SOME_OTHER_FLAG] [RequestTag=0 ErrorCode=3 ErrorText="some error message" ID=0:0x0000000000000000 Foo=1 Bar=5 Joe=5]

Essentially, I'd want to strip out the timestamp, logging-level, module, function etc - and possibly the tag-value pairs?

And yes, based on what you said, I probably will use the "in" loop first outside the regex - the lines I'm searching for are fairly few compared to the overall log size.

Cheers,
Victor

On Thursday, 13 December 2012 12:09:33 UTC+11, Steven D'Aprano  wrote:
> On Wed, 12 Dec 2012 14:35:41 -0800, Victor Hooi wrote:
> 
> 
> 
> > Hi,
> 
> > 
> 
> > I have a script that trawls through log files looking for certain error
> 
> > conditions. These are identified via certain keywords (all different) in
> 
> > those lines
> 
> > 
> 
> > I then process those lines using regex groups to extract certain fields.
> 
> [...]
> 
> > Also, my regexs could possibly be tuned, they look something like this:
> 
> > 
> 
> >     (?P<timestamp>\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P<logginglevel>\w+)\s*
> 
> \]\s*\[(?P<module>\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P<function>\w+)\s*\]
> 
> \s*level\(\d\) broadcast\s*\(\[(?P<f1_instance>\w+)\]\s*\[(?P<foo>\w+)\]
> 
> \s*(?P<bar>\w{4}):(?P<feedcode>\w+) failed order: (?P<side>\w+) (?
> 
> P<volume>\d+) @ (?P<price>[\d.]+), error on update \(\d+ : Some error 
> 
> string. Active Orders=(?P<value>\d+) Limit=(?P<limit>\d+)\)\)
> 
> >
> 
> > (Feel free to suggest any tuning, if you think they need it).
> 
> 
> 
> "Tuning"? I think it needs to be taken out and killed with a stake to the 
> 
> heart, then buried in concrete! :-)
> 
> 
> 
> An appropriate quote:
> 
> 
> 
>     Some people, when confronted with a problem, think "I know, 
> 
>     I'll use regular expressions." Now they have two problems.
> 
>     -- Jamie Zawinski
> 
> 
> 
> Is this actually meant to be a single regex, or did your email somehow 
> 
> mangle multiple regexes into a single line?
> 
> 
> 
> At the very least, you should write your regexes using the VERBOSE flag, 
> 
> so you can use non-significant whitespace and comments. There is no 
> 
> performance cost to using VERBOSE once they are compiled, but a huge 
> 
> maintainability benefit.
> 
> 
> 
> 
> 
> > My question is - I've heard that using the "in" membership operator is
> 
> > substantially faster than using Python regexes.
> 
> > 
> 
> > Is this true? What is the technical explanation for this? And what sort
> 
> > of performance characteristics are there between the two?
> 
> 
> 
> Yes, it is true. The technical explanation is simple:
> 
> 
> 
> * the "in" operator implements simple substring matching, 
> 
>   which is trivial to perform and fast;
> 
> 
> 
> * regexes are an interpreted mini-language which operate via
> 
>   a complex state machine that needs to do a lot of work,
> 
>   which is complicated to perform and slow.
> 
> 
> 
> Python's regex engine is not as finely tuned as (say) Perl's, but even in 
> 
> Perl simple substring matching ought to be faster, simply because you are 
> 
> doing less work to match a substring than to run a regex.
> 
> 
> 
> But the real advantage to using "in" is readability and maintainability.
> 
> 
> 
> As for the performance characteristics, you really need to do your own 
> 
> testing. Performance will depend on what you are searching for, where you 
> 
> are searching for it, whether it is found or not, your version of Python, 
> 
> your operating system, your hardware.
> 
> 
> 
> At some level of complexity, you are better off just using a regex rather 
> 
> than implementing your own buggy, complicated expression matcher: for 
> 
> some matching tasks, there is no reasonable substitute to regexes. But 
> 
> for *simple* uses, you should prefer *simple* code:
> 
> 
> 
> [steve at ando ~]$ python -m timeit \
> 
> > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \
> 
> > "'xyz' in data"
> 
> 100000 loops, best of 3: 4.17 usec per loop
> 
> 
> 
> [steve at ando ~]$ python -m timeit \
> 
> > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \
> 
> > -s "from re import search" \
> 
> > "search('xyz', data)"
> 
> 100000 loops, best of 3: 10.9 usec per loop
> 
> 
> 
> 
> 
> 
> 
> > (I couldn't find much in the way of docs for "in", just the brief
> 
> > mention here -
> 
> > http://docs.python.org/2/reference/expressions.html#not-in )
> 
> > 
> 
> > Would I be substantially better off using a list of strings and using
> 
> > "in" against each line, then using a second pass of regex only on the
> 
> > matched lines?
> 
> 
> 
> That's hard to say. It depends on whether you are matching on a substring 
> 
> that will frequently be found close to the start of each line, or 
> 
> something else.
> 
> 
> 
> Where I expect you *will* see a good benefit is:
> 
> 
> 
> * you have many lines to search;
> 
> * but only a few actually match the regex;
> 
> * the regex is quite complicated, and needs to backtrack a lot;
> 
> * but you can eliminate most of the "no match" cases with a simple
> 
>   substring match.
> 
> 
> 
> If you are in this situation, then very likely you will see a big benefit 
> 
> from a two-pass search:
> 
> 
> 
> for line in log:
> 
>     if any(substr in line for substr in list_of_substrings):
> 
>         # now test against a regex
> 
> 
> 
> 
> 
> Otherwise, maybe, maybe not.
> 
> 
> 
> 
> 
> -- 
> 
> Steven