Candidate for a new itertool

Thu Mar 12 13:36:30 EDT 2009

On Mar 7, 8:47 pm, Raymond Hettinger <pyt... at rcn.com> wrote:
> The existing groupby() itertool works great when every element in a
> group has the same key, but it is not so handy when groups are
> determined by boundary conditions.
>
> For edge-triggered events, we need to convert a boundary-event
> predicate to groupby-style key function.  The code below encapsulates
> that process in a new itertool called split_on().
>
> Would love you guys to experiment with it for a bit and confirm that
> you find it useful.  Suggestions are welcome.
>
> Raymond
>
> -----------------------------------------
>
> from itertools import groupby
>
> def split_on(iterable, event, start=True):
>     'Split iterable on event boundaries (either start events or stop
> events).'
>     # split_on('X1X23X456X', 'X'.__eq__, True)  --> X1 X23 X456 X
>     # split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
>     def transition_counter(x, start=start, cnt=[0]):
>         before = cnt[0]
>         if event(x):
>             cnt[0] += 1
>         after = cnt[0]
>         return after if start else before
>     return (g for k, g in groupby(iterable, transition_counter))
>
> if __name__ == '__main__':
>     for start in True, False:
>         for g in split_on('X1X23X456X', 'X'.__eq__, start):
>             print list(g)
>         print
>
>     from pprint import pprint
>     boundary = '--===============2615450625767277916==\n'
>     email = open('email.txt')
>     for mime_section in split_on(email, boundary.__eq__):
>         pprint(list(mime_section, 1, None))
>         print '= = ' * 30

For me your examples don't justify why you would need such a general
algorithm. A split function that works on iterables instead of just
strings seems straightforward, so maybe we should have that and
another one function with examples of problems where a plain split
does not work.
Something like this should work for the two examples you gave were the
boundaries are a known constants (and therefore there is really no
need to keep them. I can always add them later):

def split_on(iterable, boundary):
    l=[]
    for el in iterable:
        if el!=boundary:
            l.append(el)
        else:
            yield l
            l=[]
    yield l

def join_on(iterable, boundary):
    it=iter(iterable)
    firstel=it.next()
    for el in it:
        yield boundary
        for x in el:
            yield x

if __name__ == '__main__':
    lst=[]
    for g in split_on('X1X23X456X', 'X'):
        print list(g)
        lst.append(g)
    print
    print list(join_on(lst,'X'))