[Numpy-discussion] Fwd: Advice for grouping recarrays

Wed Apr 28 14:44:13 EDT 2010

Someone inquired about this one today and I wanted to clarify there is
now a better way to do this that I didn't know about when I posted the
original:

>>> ind = numpy.array([0,0,0,0,1,1,1,2,2,2,])
>>> data = numpy.arange(10)
>>> borders = numpy.arange(len(ind)).compress(numpy.hstack([[1], ind[1:]!=ind[:-1]]))
>>> numpy.add.reduceat(data, borders)
array([ 6, 15, 24])


On Tue, Jul 18, 2006 at 8:49 AM, Tom Denniston
<tom.denniston at alum.dartmouth.org> wrote:
> I suggest
>
> lexsort
> itertools.groupby of the indices
> take
>
> I think it would be really great if numpy had the first two as a
> function or something like that.  It is really useful to be able to
> take an array and bucket it and apply further numpy operations like
> accumulation functions.
>
> On 7/18/06, Stephen Simmons <mail at stevesimmons.com> wrote:
>>
>> Hi,
>>
>> Does anyone have any suggestions for summarising data in numpy?
>>
>> The quick description is that I want to do something like the SQL
>> statement:
>>   SELECT sum(field1), sum(field2) FROM  table GROUP BY field3;
>>
>> The more accurate description is that my data is stored in PyTables HDF
>> format, with 24 monthly files, each with 4m records describing how
>> customers performed that month. Each record looks something like this:
>> ('200604', 651404500000L, '800', 'K', 12L, 162.0, 2000.0, 0.054581, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 8.80, 0.86, 7.80 17.46, 0.0, 70.0, 0.0, 70.0, -142.93, 0.0, 2000.0,
>> 2063.93, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -9.71, 7.75,
>> 87.46, 77.75, -3.45, 0.22, -0.45, -0.57, 73.95)
>> The first 5 fields are status fields (month_string, account_number,
>> product_code, account_status, months_since_customer_joined). The
>> remaining 48 fields represent different aspects of the customer's
>> performance during that month. I read 100,000 of these records at a time
>> and turn them into a numpy recarray with:
>>   dat = hdf_table.read(start=pos, stop=pos+block_size)
>>   dat = numpy.asarray(dat._flatArray, dtype=dat.array_descr)
>>
>> I'd like to reduce these 96m records x 53 fields down to monthly
>> averages for each tuple (month_string, months_since_customer_joined)
>> which in the case above is ('200604', 12L). This will let me compare the
>> performance of newly acquired customers at the same point in their
>> lifecycle as customers acquired 1 or 2 years ago.
>>
>> The end result should be a dataset something like
>>   res[month_index, months_since_customer_joined]
>>   = array([ num_records, sum_field_5, sum_field_6, sum_field_7, ...
>> sum_field_52 ])
>> with a shape of (24, 24, 49).
>>
>> I've played around with lexsort(), take(), sum(), etc, but get very
>> confused and end up feeling that I'm making things more complicated than
>> they need to be. So any advice from numpy veterans on how best to
>> proceed would be very welcome!
>>
>> Cheers
>>
>> Stephen
>>
>> -------------------------------------------------------------------------
>> Take Surveys. Earn Cash. Influence the Future of IT
>> Join SourceForge.net's Techsay panel and you'll get the chance to share
>> your
>> opinions on IT & business topics through brief surveys -- and earn cash
>> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>>
>