[Numpy-discussion] Advice for grouping recarrays

Tue Jul 18 09:49:51 EDT 2006

I suggest

lexsort
itertools.groupby of the indices
take

I think it would be really great if numpy had the first two as a
function or something like that.  It is really useful to be able to
take an array and bucket it and apply further numpy operations like
accumulation functions.

On 7/18/06, Stephen Simmons <mail at stevesimmons.com> wrote:
> Hi,
>
> Does anyone have any suggestions for summarising data in numpy?
>
> The quick description is that I want to do something like the SQL statement:
>    SELECT sum(field1), sum(field2) FROM  table GROUP BY field3;
>
> The more accurate description is that my data is stored in PyTables HDF
> format, with 24 monthly files, each with 4m records describing how
> customers performed that month. Each record looks something like this:
> ('200604', 651404500000L, '800', 'K', 12L, 162.0, 2000.0, 0.054581, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 8.80, 0.86, 7.80 17.46, 0.0, 70.0, 0.0, 70.0, -142.93, 0.0, 2000.0,
> 2063.93, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -9.71, 7.75,
> 87.46, 77.75, -3.45, 0.22, -0.45, -0.57, 73.95)
> The first 5 fields are status fields (month_string, account_number,
> product_code, account_status, months_since_customer_joined). The
> remaining 48 fields represent different aspects of the customer's
> performance during that month. I read 100,000 of these records at a time
> and turn them into a numpy recarray with:
>    dat = hdf_table.read(start=pos, stop=pos+block_size)
>    dat = numpy.asarray(dat._flatArray, dtype=dat.array_descr)
>
> I'd like to reduce these 96m records x 53 fields down to monthly
> averages for each tuple (month_string, months_since_customer_joined)
> which in the case above is ('200604', 12L). This will let me compare the
> performance of newly acquired customers at the same point in their
> lifecycle as customers acquired 1 or 2 years ago.
>
> The end result should be a dataset something like
>    res[month_index, months_since_customer_joined]
>    = array([ num_records, sum_field_5, sum_field_6, sum_field_7, ...
> sum_field_52 ])
> with a shape of (24, 24, 49).
>
> I've played around with lexsort(), take(), sum(), etc, but get very
> confused and end up feeling that I'm making things more complicated than
> they need to be. So any advice from numpy veterans on how best to
> proceed would be very welcome!
>
> Cheers
>
> Stephen
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>