itertools.groupby

Sun May 27 23:28:26 EDT 2007

Raymond Hettinger <python at rcn.com> writes:
> The groupby itertool came-out in Py2.4 and has had remarkable
> success (people seem to get what it does and like using it, and
> there have been no bug reports or reports of usability problems).
> All in all, that ain't bad (for what 7stud calls a poster child).

I use the module all the time now and it is great.  Basically it
gets rid of the problem of the "lump moving through the snake"
when iterating through a sequence, noticing when some condition
changes, and having to juggle an element from one call to another.
That said, I too found the docs a little confusing on first reading.
I'll see if I can go over them again and suggest improvements.

Here for me is a typical example: you have a file of multi-line
records.  Each record starts with a marker saying "New Record".  You
want to iterate through the records.  You could do it by collecting
lines til you see a new record marker, then juggling the marker into
the next record somehow; in some situations you could do it by some
kind of pushback mechanism that's not supported in the general
iterator protocol (maybe it should be); I like to do it with what I
call a "Bates stamp". (A Bates stamp is a rubber stamp with a serial
numbering mechanism, so each time you operate it the number goes
higher by one.  You use it to stamp serial numbers on pages of legal
documents and that sort of thing).  I use enumerate to supply Bates
numbers to the lines from the file, incrementing the number every
time there's a new record:

   fst = operator.itemgetter(0)
   snd = operator.itemgetter(1)

   def bates(fd):
     # generate tuples (n,d) of lines from file fd,
     # where n is the record number.  Just iterate through all lines
     # of the file, stamping a number on each one as it goes by.
     n = 0   # record number
     for d in fd:
        if d.startswith('New Record'): n += 1
        yield (n, d)

   def records(fd):
      for i,d in groupby(bates(fd), fst):
         yield imap(snd, d)

This shows a "straight paper path" approach where all the buffering
and juggling is hidden inside groupby.