[AstroPy] astropy.table.Table groups: aggregate over & combine multiple columns?

Wed Jan 20 12:40:27 EST 2016

Excellent question indeed.

The first quick comment I have is to always be aware that directly using
the functions np.sum and np.mean in aggregation will be orders of magnitude
faster than calling the `average()` function that was defined in the
original post.  That is because in those special cases the numpy `reduceat`
method is called and everything gets done in the C layer.  Thus Andrew's
suggestion for a workaround in this case is the right way to go if the data
tables are large.

About the more generalized problem of getting access to the other table
columns within the aggregation function, that is unfortunately not possible
in the current release code.  I have an idea for doing this, which is now
an astropy issue (https://github.com/astropy/astropy/issues/4513).

As for what to do right now with astropy 1.1, the following illustrates how
to do generalized aggregation in the way that is needed for this example.
It will be relatively slow and possibly memory intensive, but if the tables
are not huge that won't be a problem:

from __future__ import division, print_function
from astropy import table
from astropy.table import Table
from collections import OrderedDict

t = Table([['a', 'a', 'a', 'b', 'b', 'c'],
           [1, 2, 3, 4, 5, 6],
           [2, 2, 1, 2, 1, 1]],
          names=('name', 'value', 'weight'))

grouped = t.group_by('name')

def transform_table(tbl):
    """
    Generalized function that takes table ``tbl`` as input
    and returns a new Table ``out``.  Note that ``out`` does not
    necessarily need to have the same types or columns.

    The example is just the identity transform.  Be aware that
    in-place operations will affect the input table.
    """
    out = tbl
    return out

out_tables = []
for group in grouped.groups:
    out_tables.append(transform_table(group))
result = table.vstack(out_tables)
print('transform_table')
print(result)
print()

def average_weighted(tbl, name):
    col = tbl[name]
    if name == 'weight':
        value = col.sum()
    else:
        weight = tbl['weight']
        value = (col * weight).sum() / weight.sum()

    return value

def transform_table_to_row(tbl, func):
    """
    Generalized function that takes table ``tbl`` as input
    and returns a new table row as an OrderedDict.  It applies
    function ``func`` to each column.

    The example computes the weighted average of each field (where
    possible) assuming the weights are in column ``weight``.
    """
    out = OrderedDict()
    for name in t.colnames:
        try:
            value = func(tbl, name)
        except:
            # If something went wrong just ignore (could not perform
            # operation on this column).
            pass
        else:
            out[name] = value
    return out

out_rows = []
for group in grouped.groups:
    out_rows.append(transform_table_to_row(group, average_weighted))
result = Table(rows=out_rows)

print('transform_table_to_row')
print(result)

Code also at https://gist.github.com/taldcroft/12249ad7eeacbec12f44.

Cheers,
Tom

On Wed, Jan 20, 2016 at 7:47 AM, Andrew Hearin <andrew.hearin at yale.edu>
wrote:

> Hi Evert,
>
> Great question, I'm also really interested to hear the answer to this. I
> always use the built-in table aggregation functions when possible, but
> sometimes end up writing my own Numpy calculation for more complicated
> examples (using np.unique and/or np.searchsorted).
>
> For computing a group-wise weighted average, there is a way you can recast
> your problem that allows you to use the existing astropy built-in: just
> create a new column that is the product of your second and third columns, '
> and then use aggregate(average)  in the normal way on this new column.
>
> So I *think* that gives an answer to the specific example you gave, but it
> dodges the real question, which I am also interested to hear the experts
> weigh in on.
>
> Andrew
>
> On Tue, Jan 19, 2016 at 11:52 PM, Evert Rol <evert.rol at gmail.com> wrote:
>
>> Is there a way in an astropy table to run the TableGroups aggregate
>> function on multiple columns at once?
>>
>> In this specific case, I'd like to group by names in one column, and then
>> average the second column weighted by values in the third column.
>> An example would be:
>>
>>   from astropy.table import Table
>>
>>   def average(col):
>>       # Manipulate multiple columns at once?
>>       return col.mean()
>>
>>   t = Table([['a', 'a', 'a', 'b', 'b', 'c'],
>>              [1, 2, 3, 4, 5, 6],
>>              [2, 2, 1, 2, 1, 1]],
>>             names=('name', 'value', 'weight'))
>>   group = t.group_by('name')
>>   result = group.groups.aggregate(average)
>>   print(result)
>>
>> which gives
>>
>>   name value     weight
>>   ---- ----- -------------
>>      a   2.0 1.66666666667
>>      b   4.5           1.5
>>      c   6.0           1.0
>>
>> which is not what I want.
>>
>>
>> In Pandas, this can be done with apply() on a groupby object, since that
>> passes the relevant subsection the dataframe as input to the function.
>> So I can write:
>>
>>   def average_pd(df):
>>       weight = df['weight']
>>       total = weight.sum()
>>       df['value'] *= weight / total
>>       df['value'] = df['value'].sum()
>>       df['weight'] = total  # for info; not necessary
>>       return df.iloc[0]  # ignore other rows: they are the same anyway
>>
>>   df = t.to_pandas()
>>   result = df.groupby('name')[['value', 'weight']].apply(average_pd)
>>   print(result)
>>
>> which gives:
>>
>>          value  weight
>> name
>> a     1.800000       5
>> b     4.333333       3
>> c     6.000000       1
>>
>> and 'value' consists of weighted averages.
>>
>> (code also on
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_evertrol_12955a5d98edf055a2f4&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=-BKjmG3hTRkdOfmFIKI5e3myB8cKiFHeJbTAhi3Zg5U&s=ix-QzeHis8ltaMFyVo3QvHpnQYri_s75MpTGsufcbqM&e=
>> )
>>
>>
>> Perhaps I overlooked some documentation, but I can't find if this can be
>> done in astropy.table. Or do I just need to approach this differently?
>> Alternatively, should I convert & stick to Pandas for this type of
>> functionality?
>>
>>
>>   Evert
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> AstroPy mailing list
>> AstroPy at scipy.org
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.scipy.org_mailman_listinfo_astropy&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=-BKjmG3hTRkdOfmFIKI5e3myB8cKiFHeJbTAhi3Zg5U&s=XZ616g8wR7LBzFglTQ8J2F-bDe6rE-HuXIrePKntv6w&e=
>>
>
>
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> https://mail.scipy.org/mailman/listinfo/astropy
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/astropy/attachments/20160120/867b2547/attachment.html>