split and regexp on textfile

Fri Apr 13 05:59:28 EDT 2007

Flyzone:
> i have a problem with the split function and regexp.
> I have a file that i want to split using the date as token.

My first try:

data = """
error text
Mon Apr  9 22:30:18 2007
text
text
Mon Apr  9 22:31:10 2007
text
text
Mon Apr 10 22:31:10 2007
text
text
"""

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

section = []
for line in data.splitlines():
    if date_find.search(line):
        if section:
            print "\n" + "-" * 10 + "\n", "\n".join(section)
        section = [line]
    else:
        if line:
            section.append(line)

print "\n" + "-" * 10 + "\n", "\n".join(section)

itertools.groupby() is fit to split sequences like:
1111100011111100011100101011111
as:
11111 000 111111 000 111 00 1 0 1 0 11111
While here we have a sequence like:
100001000101100001000000010000
that has to be splitted as:
10000 1000 10 1 10000 10000000 10000
A standard itertool can be added for such quite common situation too.

Along those lines I have devised this different (and maybe over-
engineered) version:

from itertools import groupby
import re

class Splitter(object):
    # Not tested much
    def __init__(self, predicate):
        self.predicate = predicate
        self.precedent_el = None
        self.state = True
    def __call__(self, el):
        if self.predicate(el):
            self.state = not self.state
        self.precedent_el = el
        return self.state

date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")
splitter = Splitter(date_find.search)

sections = ("\n".join(g) for h,g in groupby(data.splitlines(),
key=splitter))
for section in sections:
    if section:
        print "\n" + "-" * 10 + "\n", section

The Splitter class + the groupby can become a single simpler
generator, like in this this version:

def grouper(seq, key=bool):
    # A fast identity function can be used instead of bool()
    # Not tested much
    group = []
    for part in seq:
        if key(part):
            if group: yield group
            group = [part]
        else:
            group.append(part)
    yield group

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

for section in grouper(data.splitlines(), date_find.search):
    print "\n" + "-" * 10 + "\n", "\n".join(section)

Maybe that grouper can be modified to manage group lazily, like
groupby does, instead of building a true list.

Flyzone (seen later):
>Amm..not! I need to get the text-block between the two data, not the data! :)

Then you can modify the code like this:

def grouper(seq, key=bool):
    group = []
    for part in seq:
        if key(part):
            if group: yield group
            group = [] # changed
        else:
            group.append(part)
    yield group

Bye,
bearophile