"groupby" is brilliant!
James Stroud
jstroud at ucla.edu
Tue Jun 13 16:27:58 EDT 2006
James Stroud wrote:
> Frank Millman wrote:
>
>> Hi all
>>
>> This is probably old hat to most of you, but for me it was a
>> revelation, so I thought I would share it in case someone has a similar
>> requirement.
>>
>> I had to convert an old program that does a traditional pass through a
>> sorted data file, breaking on a change of certain fields, processing
>> each row, accumulating various totals, and doing additional processing
>> at each break. I am not using a database for this one, as the file
>> sizes are not large - a few thousand rows at most. I am using csv
>> files, and using the csv module so that each row is nicely formatted
>> into a list.
>>
>> The traditional approach is quite fiddly, saving the values of the
>> various break fields, comparing the values on each row with the saved
>> values, and taking action if the values differ. The more break fields
>> there are, the fiddlier it gets.
>>
>> I was going to do the same in python, but then I vaguely remembered
>> reading about 'groupby'. It took a little while to figure it out, but
>> once I had cracked it, it transformed the task into one of utter
>> simplicity.
>>
>> Here is an example. Imagine a transaction file sorted by branch,
>> account number, and date, and you want to break on all three.
>>
>> -----------------------------
>> import csv
>> from itertools import groupby
>> from operator import itemgetter
>>
>> BRN = 0
>> ACC = 1
>> DATE = 2
>>
>> reader = csv.reader(open('trans.csv', 'rb'))
>> rows = []
>> for row in reader:
>> rows.append(row)
>>
>> for brn,brnList in groupby(rows,itemgetter(BRN)):
>> for acc,accList in groupby(brnList,itemgetter(ACC)):
>> for date,dateList in groupby(accList,itemgetter(DATE)):
>> for row in dateList:
>> [do something with row]
>> [do something on change of date]
>> [do something on change of acc]
>> [do something on change of brn]
>> -----------------------------
>>
>> Hope someone finds this of interest.
>>
>> Frank Millman
>>
>
> I'm sure I'm going to get a lot of flac on this list for proposing to
> turn nested for-loops into a recursive function, but I couldn't help
> myself. This seems more simple to me, but for others it may be difficult
> to look at, and these people will undoubtedly complain.
>
>
> import csv
> from itertools import groupby
> from operator import itemgetter
>
> reader = csv.reader(open('trans.csv', 'rb'))
> rows = []
> for row in reader:
> rows.append(row)
>
> def brn_doer(row):
> [doing something with brn here]
>
> def acc_doer(date):
> [you get the idea]
>
> [etc.]
>
> doers = [brn_doer, acc_doer, date_doer, row_doer]
>
> def doit(rows, doers, i=0):
> for r, alist in groupby(rows, itemgetter(i)):
> doit(alist, doers[1:], i+1)
> doers[0](r)
>
> doit(rows, doers, 0)
>
> Now all of those ugly for loops become one recursive function. Bear in
> mind, its not all that 'elegant', but it looks nicer, is more succinct,
> abstracts the process, and scales to arbitrary depth. Tragically,
> however, it has been generalized, which is likely to raise some hackles
> here. And, oh yes, it didn't answer exactly your question (which you
> didn't really have). I'm sure I will regret this becuase, as you will
> find, suggesting code on this list with additional utility is somewhat
> discouraged by the vociferous few who make a religion out of 'import this'.
>
> Also, I still have no idea what 'groupby' does. It looks interesting
> thgough, thanks for pointing it out.
>
> James
>
Forgot to test for stopping condition:
def doit(rows, doers, i=0):
for r, alist in groupby(rows, itemgetter(i)):
if len(doers) > 1:
doit(alist, doers[1:], i+1)
doers[0](r)
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095
http://www.jamesstroud.com/
More information about the Python-list
mailing list