Using regexes versus "in" membership test?
Victor Hooi
victorhooi at gmail.com
Thu Dec 13 01:10:23 EST 2012
Hi,
That was actually *one* regex expression...lol.
And yes, it probably is a bit convoluted.
Thanks for the tip about using VERBOSE - I'll use that, and comment my regex - that's a useful tip.
Are there any other general pointers you might give for that regex? The lines I'm trying to match look something like this:
07:40:05.793627975 [Info ] [SOME_MODULE] [SOME_FUNCTION] [SOME_OTHER_FLAG] [RequestTag=0 ErrorCode=3 ErrorText="some error message" ID=0:0x0000000000000000 Foo=1 Bar=5 Joe=5]
Essentially, I'd want to strip out the timestamp, logging-level, module, function etc - and possibly the tag-value pairs?
And yes, based on what you said, I probably will use the "in" loop first outside the regex - the lines I'm searching for are fairly few compared to the overall log size.
Cheers,
Victor
On Thursday, 13 December 2012 12:09:33 UTC+11, Steven D'Aprano wrote:
> On Wed, 12 Dec 2012 14:35:41 -0800, Victor Hooi wrote:
>
>
>
> > Hi,
>
> >
>
> > I have a script that trawls through log files looking for certain error
>
> > conditions. These are identified via certain keywords (all different) in
>
> > those lines
>
> >
>
> > I then process those lines using regex groups to extract certain fields.
>
> [...]
>
> > Also, my regexs could possibly be tuned, they look something like this:
>
> >
>
> > (?P<timestamp>\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P<logginglevel>\w+)\s*
>
> \]\s*\[(?P<module>\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P<function>\w+)\s*\]
>
> \s*level\(\d\) broadcast\s*\(\[(?P<f1_instance>\w+)\]\s*\[(?P<foo>\w+)\]
>
> \s*(?P<bar>\w{4}):(?P<feedcode>\w+) failed order: (?P<side>\w+) (?
>
> P<volume>\d+) @ (?P<price>[\d.]+), error on update \(\d+ : Some error
>
> string. Active Orders=(?P<value>\d+) Limit=(?P<limit>\d+)\)\)
>
> >
>
> > (Feel free to suggest any tuning, if you think they need it).
>
>
>
> "Tuning"? I think it needs to be taken out and killed with a stake to the
>
> heart, then buried in concrete! :-)
>
>
>
> An appropriate quote:
>
>
>
> Some people, when confronted with a problem, think "I know,
>
> I'll use regular expressions." Now they have two problems.
>
> -- Jamie Zawinski
>
>
>
> Is this actually meant to be a single regex, or did your email somehow
>
> mangle multiple regexes into a single line?
>
>
>
> At the very least, you should write your regexes using the VERBOSE flag,
>
> so you can use non-significant whitespace and comments. There is no
>
> performance cost to using VERBOSE once they are compiled, but a huge
>
> maintainability benefit.
>
>
>
>
>
> > My question is - I've heard that using the "in" membership operator is
>
> > substantially faster than using Python regexes.
>
> >
>
> > Is this true? What is the technical explanation for this? And what sort
>
> > of performance characteristics are there between the two?
>
>
>
> Yes, it is true. The technical explanation is simple:
>
>
>
> * the "in" operator implements simple substring matching,
>
> which is trivial to perform and fast;
>
>
>
> * regexes are an interpreted mini-language which operate via
>
> a complex state machine that needs to do a lot of work,
>
> which is complicated to perform and slow.
>
>
>
> Python's regex engine is not as finely tuned as (say) Perl's, but even in
>
> Perl simple substring matching ought to be faster, simply because you are
>
> doing less work to match a substring than to run a regex.
>
>
>
> But the real advantage to using "in" is readability and maintainability.
>
>
>
> As for the performance characteristics, you really need to do your own
>
> testing. Performance will depend on what you are searching for, where you
>
> are searching for it, whether it is found or not, your version of Python,
>
> your operating system, your hardware.
>
>
>
> At some level of complexity, you are better off just using a regex rather
>
> than implementing your own buggy, complicated expression matcher: for
>
> some matching tasks, there is no reasonable substitute to regexes. But
>
> for *simple* uses, you should prefer *simple* code:
>
>
>
> [steve at ando ~]$ python -m timeit \
>
> > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \
>
> > "'xyz' in data"
>
> 100000 loops, best of 3: 4.17 usec per loop
>
>
>
> [steve at ando ~]$ python -m timeit \
>
> > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \
>
> > -s "from re import search" \
>
> > "search('xyz', data)"
>
> 100000 loops, best of 3: 10.9 usec per loop
>
>
>
>
>
>
>
> > (I couldn't find much in the way of docs for "in", just the brief
>
> > mention here -
>
> > http://docs.python.org/2/reference/expressions.html#not-in )
>
> >
>
> > Would I be substantially better off using a list of strings and using
>
> > "in" against each line, then using a second pass of regex only on the
>
> > matched lines?
>
>
>
> That's hard to say. It depends on whether you are matching on a substring
>
> that will frequently be found close to the start of each line, or
>
> something else.
>
>
>
> Where I expect you *will* see a good benefit is:
>
>
>
> * you have many lines to search;
>
> * but only a few actually match the regex;
>
> * the regex is quite complicated, and needs to backtrack a lot;
>
> * but you can eliminate most of the "no match" cases with a simple
>
> substring match.
>
>
>
> If you are in this situation, then very likely you will see a big benefit
>
> from a two-pass search:
>
>
>
> for line in log:
>
> if any(substr in line for substr in list_of_substrings):
>
> # now test against a regex
>
>
>
>
>
> Otherwise, maybe, maybe not.
>
>
>
>
>
> --
>
> Steven
More information about the Python-list
mailing list