[Tutor] Regular expression on python

Tue Apr 14 16:37:14 CEST 2015

Steven D'Aprano wrote:

> On Tue, Apr 14, 2015 at 10:00:47AM +0200, Peter Otten wrote:
>> Steven D'Aprano wrote:
> 
>> > I swear that Perl has been a blight on an entire generation of
>> > programmers. All they know is regular expressions, so they turn every
>> > data processing problem into a regular expression. Or at least they
>> > *try* to. As you have learned, regular expressions are hard to read,
>> > hard to write, and hard to get correct.
>> > 
>> > Let's write some Python code instead.
> [...]
> 
>> The tempter took posession of me and dictated:
>> 
>> >>> pprint.pprint(
>> ... [(k, int(v)) for k, v in
>> ... re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall(line)])
>> [('Input Read Pairs', 2127436),
>>  ('Both Surviving', 1795091),
>>  ('Forward Only Surviving', 17315),
>>  ('Reverse Only Surviving', 6413),
>>  ('Dropped', 308617)]
> 
> Nicely done :-)
> 
> I didn't say that it *couldn't* be done with a regex. 

I didn't claim that.

> Only that it is
> harder to read, write, etc. Regexes are good tools, but they aren't the
> only tool and as a beginner, which would you rather debug? The extract()
> function I wrote, or r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*" ?

I know a rhetorical question when I see one ;)

> Oh, and for the record, your solution is roughly 4-5 times faster than
> the extract() function on my computer. 

I wouldn't be bothered by that. See below if you are.

> If I knew the requirements were
> not likely to change (that is, the maintenance burden was likely to be
> low), I'd be quite happy to use your regex solution in production code,
> although I would probably want to write it out in verbose mode just in
> case the requirements did change:
> 
> 
> r"""(?x)    (?# verbose mode)
>     (.+?):  (?# capture one or more character, followed by a colon)
>     \s+     (?# one or more whitespace)
>     (\d+)   (?# capture one or more digits)
>     (?:     (?# don't capture ... )
>       \s+       (?# one or more whitespace)
>       \(.*?\)   (?# anything inside round brackets)
>       )?        (?# ... and optional)
>     \s*     (?# ignore trailing spaces)
>     """
> 
> 
> That's a hint to people learning regular expressions: start in verbose
> mode, then "de-verbose" it if you must.

Regarding the speed of the Python approach: you can easily improve that by 
relatively minor modifications. The most important one is to avoid the 
exception:

$ python parse_jarod.py
$ python3 parse_jarod.py

The regex for reference:

$ python3 -m timeit -s "from parse_jarod import extract_re as extract" 
"extract()"
100000 loops, best of 3: 18.6 usec per loop

Steven's original extract():

$ python3 -m timeit -s "from parse_jarod import extract_daprano as extract" 
"extract()"
10000 loops, best of 3: 92.6 usec per loop

Avoid raising ValueError (This won't work with negative numbers):

$ python3 -m timeit -s "from parse_jarod import extract_daprano2 as extract" 
"extract()"
10000 loops, best of 3: 44.3 usec per loop

Collapse the two loops into one, thus avoiding the accumulator list and the 
isinstance() checks:

$ python3 -m timeit -s "from parse_jarod import extract_daprano3 as extract" 
"extract()"
10000 loops, best of 3: 29.6 usec per loop

Ok, this is still slower than the regex, a result that I cannot accept. 
Let's try again:

$ python3 -m timeit -s "from parse_jarod import extract_py as extract" 
"extract()"
100000 loops, best of 3: 15.1 usec per loop

Heureka? The "winning" code is brittle and probably as hard to understand as 
the regex. You can judge for yourself if you're interested:

$ cat parse_jarod.py                       
import re

line = ("Input Read Pairs: 2127436 "
        "Both Surviving: 1795091 (84.38%) "
        "Forward Only Surviving: 17315 (0.81%) "
        "Reverse Only Surviving: 6413 (0.30%) "
        "Dropped: 308617 (14.51%)")
_findall = re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall

def extract_daprano(line=line):
    # Extract key:number values from the string.
    line = line.strip()  # Remove leading and trailing whitespace.
    words = line.split()
    accumulator = []  # Collect parts of the string we care about.
    for word in words:
        if word.startswith('(') and word.endswith('%)'):
            # We don't care about percentages in brackets.
            continue
        try:
            n = int(word)
        except ValueError:
            accumulator.append(word)
        else:
            accumulator.append(n)
    # Now accumulator will be a list of strings and ints:
    # e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000]
    # Collect consecutive strings as the key, int to be the value.
    results = {}
    keyparts = []
    for item in accumulator:
        if isinstance(item, int):
            key = ' '.join(keyparts)
            keyparts = []
            if key.endswith(':'):
                key = key[:-1]
            results[key] = item
        else:
            keyparts.append(item)
    # When we have finished processing, the keyparts list should be empty.
    if keyparts:
        extra = ' '.join(keyparts)
        print('Warning: found extra text at end of line "%s".' % extra)
    return results

def extract_daprano2(line=line):
    words = line.split()
    accumulator = []
    for word in words:
        if word.startswith('(') and word.endswith('%)'):
            continue
        if word.isdigit():
            word = int(word)
        accumulator.append(word)

    results = {}
    keyparts = []
    for item in accumulator:
        if isinstance(item, int):
            key = ' '.join(keyparts)
            keyparts = []
            if key.endswith(':'):
                key = key[:-1]
            results[key] = item
        else:
            keyparts.append(item)
    # When we have finished processing, the keyparts list should be empty.
    if keyparts:
        extra = ' '.join(keyparts)
        print('Warning: found extra text at end of line "%s".' % extra)
    return results

def extract_daprano3(line=line):
    results = {}
    keyparts = []
    for word in line.split():
        if word.startswith("("):
            continue
        if word.isdigit():
            key = ' '.join(keyparts)
            keyparts = []
            if key.endswith(':'):
                key = key[:-1]
            results[key] = int(word)
        else:
            keyparts.append(word)

    # When we have finished processing, the keyparts list should be empty.
    if keyparts:
        extra = ' '.join(keyparts)
        print('Warning: found extra text at end of line "%s".' % extra)
    return results

def extract_re(line=line):
    return {k: int(v) for k, v in _findall(line)}

def extract_py(line=line):
    key = None
    result = {}
    for part in line.split(":"):
        if key is None:
            key = part
        else:
            value, new_key = part.split(None, 1)
            result[key] = int(value)
            key = new_key.rpartition(")")[-1].strip()
    return result

if __name__ == "__main__":
    assert (extract_daprano() == extract_re() == extract_daprano2()
            == extract_daprano3() == extract_py())