From an existing Pandas DataFrame, how can I create a summary DataFrame based on the union of overlapping date ranges (given a start and an end date) and an additional column?

Aaron aaron.christensen at gmail.com
Tue Jun 2 13:47:34 EDT 2020


Hello,

Given a dateframe with trips made by employees of different companies, I am
trying to generate a new dataframe with only the company names.  I am
looking to combine the overlapping travel times from employees of the SAME
company into a single row.  If there are no overlapping travel times, then
that row just transfers over as-is.  When there are overlapping travel
times, then the following will happen:

--The name field is removed b/c that is no longer relevant (company name
stays), the Depart date will be the earliest date of any of the trip dates
regardless of the employee, the Return date will be the latest date of any
of the trip dates regardless of the employee, the charges for the trip will
be summed

For example, if trips had dates 01/01/20 - 01/31/20, 01/15/20 - 02/15/20,
02/01-20 - 02/28/20, then all three would be combined.  The starting date
will be 1/1/20 and ending as of 2/28/20.  Basically, the company was on
that trip from start to finish… kinda like a relay run handing off the
baton.  Also, the charges will be summed for each of those trips and
transferred over to the single row.

Here is the starting dataframe code/output (note: the row order is
typically not already sorted by company name as in this example):

import pandas as pd


emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank',
'Lee', 'Jack'],
        'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
        'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020',
'01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
        'Return' : ['01/31/2020', '02/15/2020', '02/20/2020',
'03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
        'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
        }

df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart',
'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])

  Name Company     Depart     Return  Charges0    Bob     ABC
2020-01-01 2020-01-31    10.101    Joe     ABC 2020-01-01 2020-02-15
 20.252    Sue     ABC 2020-01-06 2020-02-20    30.323   Jack     HIJ
2020-01-01 2020-03-01    40.004  Henry     HIJ 2020-05-01 2020-05-05
 50.015  Frank     DEF 2020-01-13 2020-01-15    60.326    Lee     DEF
2020-01-12 2020-01-30    70.997   Jack     DEF 2020-01-14 2020-02-02
 80.87

And, here is the desired/generated dataframe:

  Company      Depart      Return  Charges0     ABC  01/01/2020
02/20/2020    60.671     HIJ  01/01/2020  03/01/2020    40.002     HIJ
 05/01/2020  05/05/2020    50.013     DEF  01/12/2020  02/02/2020
212.18

I have been trying to use a combination of sorting and grouping but
the best I've achieved is reordering the dataframe.  Even though I am
able to sort/group based on values, I still run into the issues of
finding overlapping date ranges and pulling out all trips based on a
single company per aggregate/overlapping date range.

Thank you in advance for any help!

Aaron


More information about the Python-list mailing list