[Pandas-dev] Create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe?

Aaron aaron.christensen at gmail.com
Mon Jun 1 10:02:06 EDT 2020


Hello,

Given a dateframe with trips made my employees of different companies, I am
trying to generate a new dataframe with only the company names.  There are
trips made by employees from different companies.  I am looking to combine
the overlapping travel times from employees of the SAME company into a
single row.  If there are no overlapping travel times, then that row just
transfers over as-is.  When there are overlapping travel times, then the
following will happen:

--The name field is removed b/c that is no longer relevant (company name
stays), the Depart date will be the earliest date of any of the trip dates
regardless of the employee, the Return date will be the latest date of any
of the trip dates regardless of the employee, the charges for the trip will
be summed

For example, if trips had dates 01/01/20 - 01/31/20, 01/15/20 - 02/15/20,
02/01-20 - 02/28/20, then all three would be combined.  The starting date
will be 1/1/20 and ending as of 2/28/20  Basically, the company was on that
trip from start to finish… kinda like a relay run handing off the baton.
Also, the charges will be summed for each of those trips and transferred
over to the single row.

I was playing around with timedelta, hierarchal indices, grouping, and
sorting but had a really hard time since I am looking at date ranges
instead of specific dates.

Here is the starting dataframe code/output:

import pandas as pd


emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank',
'Lee', 'Jack'],
        'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
        'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020',
'01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
        'Return' : ['01/31/2020', '02/15/2020', '02/20/2020',
'03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
        'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
        }

df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart',
'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])

  Name Company     Depart     Return  Charges0    Bob     ABC
2020-01-01 2020-01-31    10.101    Joe     ABC 2020-01-01 2020-02-15
 20.252    Sue     ABC 2020-01-06 2020-02-20    30.323   Jack     HIJ
2020-01-01 2020-03-01    40.004  Henry     HIJ 2020-05-01 2020-05-05
 50.015  Frank     DEF 2020-01-13 2020-01-15    60.326    Lee     DEF
2020-01-12 2020-01-30    70.997   Jack     DEF 2020-01-14 2020-02-02
 80.87

And, here is the desired/generated dataframe:

  Company      Depart      Return  Charges0     ABC  01/01/2020
02/20/2020    60.671     HIJ  01/01/2020  03/01/2020    40.002     HIJ
 05/01/2020  05/05/2020    50.013     DEF  01/12/2020  02/02/2020
212.18

Thank you in advance!

Aaron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200601/4e1f8d19/attachment-0001.html>


More information about the Pandas-dev mailing list