Using regexes versus "in" membership test?

Wed Dec 12 17:35:41 EST 2012

Hi,

I have a script that trawls through log files looking for certain error conditions. These are identified via certain keywords (all different) in those lines

I then process those lines using regex groups to extract certain fields.

Currently, I'm using a for loop to iterate through the file, and a dict of regexes:

    breaches = {
        'type1': re.compile(r'some_regex_expression'),
        'type2': re.compile(r'some_regex_expression'),
        'type3': re.compile(r'some_regex_expression'),
        'type4': re.compile(r'some_regex_expression'),
        'type5': re.compile(r'some_regex_expression'),
    }
    ...
    with open('blah.log', 'r') as f:
        for line in f:
            for breach in breaches:
                results = breaches[breach].search(line)
                if results:
                    self.logger.info('We found an error - {0} - {1}'.format(results.group('errorcode'), results.group('errormsg'))
                    # We do other things with other regex groups as well.

(This isn't the *exact* code, but it shows the logic/flow fairly closely).

For completeness, the actual regexes look something like this:

Also, my regexs could possibly be tuned, they look something like this:

    (?P<timestamp>\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P<logginglevel>\w+)\s*\]\s*\[(?P<module>\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P<function>\w+)\s*\]\s*level\(\d\) broadcast\s*\(\[(?P<f1_instance>\w+)\]\s*\[(?P<foo>\w+)\]\s*(?P<bar>\w{4}):(?P<feedcode>\w+) failed order: (?P<side>\w+) (?P<volume>\d+) @ (?P<price>[\d.]+), error on update \(\d+ : Some error string. Active Orders=(?P<value>\d+) Limit=(?P<limit>\d+)\)\)

(Feel free to suggest any tuning, if you think they need it).

My question is - I've heard that using the "in" membership operator is substantially faster than using Python regexes.

Is this true? What is the technical explanation for this? And what sort of performance characteristics are there between the two?

(I couldn't find much in the way of docs for "in", just the brief mention here - http://docs.python.org/2/reference/expressions.html#not-in )

Would I be substantially better off using a list of strings and using "in" against each line, then using a second pass of regex only on the matched lines?

(Log files are compressed, I'm actually using bz2 to read them in, uncompressed size is around 40-50 Gb).

Cheers,
Victor