[Pandas-dev] Create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe?
Aaron
aaron.christensen at gmail.com
Mon Jun 1 10:02:06 EDT 2020
Hello,
Given a dateframe with trips made my employees of different companies, I am
trying to generate a new dataframe with only the company names. There are
trips made by employees from different companies. I am looking to combine
the overlapping travel times from employees of the SAME company into a
single row. If there are no overlapping travel times, then that row just
transfers over as-is. When there are overlapping travel times, then the
following will happen:
--The name field is removed b/c that is no longer relevant (company name
stays), the Depart date will be the earliest date of any of the trip dates
regardless of the employee, the Return date will be the latest date of any
of the trip dates regardless of the employee, the charges for the trip will
be summed
For example, if trips had dates 01/01/20 - 01/31/20, 01/15/20 - 02/15/20,
02/01-20 - 02/28/20, then all three would be combined. The starting date
will be 1/1/20 and ending as of 2/28/20 Basically, the company was on that
trip from start to finish… kinda like a relay run handing off the baton.
Also, the charges will be summed for each of those trips and transferred
over to the single row.
I was playing around with timedelta, hierarchal indices, grouping, and
sorting but had a really hard time since I am looking at date ranges
instead of specific dates.
Here is the starting dataframe code/output:
import pandas as pd
emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank',
'Lee', 'Jack'],
'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020',
'01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
'Return' : ['01/31/2020', '02/15/2020', '02/20/2020',
'03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
}
df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart',
'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])
Name Company Depart Return Charges0 Bob ABC
2020-01-01 2020-01-31 10.101 Joe ABC 2020-01-01 2020-02-15
20.252 Sue ABC 2020-01-06 2020-02-20 30.323 Jack HIJ
2020-01-01 2020-03-01 40.004 Henry HIJ 2020-05-01 2020-05-05
50.015 Frank DEF 2020-01-13 2020-01-15 60.326 Lee DEF
2020-01-12 2020-01-30 70.997 Jack DEF 2020-01-14 2020-02-02
80.87
And, here is the desired/generated dataframe:
Company Depart Return Charges0 ABC 01/01/2020
02/20/2020 60.671 HIJ 01/01/2020 03/01/2020 40.002 HIJ
05/01/2020 05/05/2020 50.013 DEF 01/12/2020 02/02/2020
212.18
Thank you in advance!
Aaron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200601/4e1f8d19/attachment-0001.html>
More information about the Pandas-dev
mailing list