计算按客户分组的天数与最后日期的差异

Calculate the difference to the last date in days grouped by customer

我有问题。我想得到最后日期的差异。例如 2021-03-22 到下一个日期 (2021-03-18) 是 4 天。我想计算 customerId 的行日期和最后日期之间的天数差异。所以完整的计算应该针对每个客户。最后一个日期应该是 None 因为我没有更早的日期。

数据框

   customerId    fromDate otherInformation
0           1  2021-02-22              Cat
1           1  2021-03-18              Dog
2           1  2021-03-22              Cat
3           1  2021-02-10              Cat
4           1  2021-09-07              Cat
5           1        None          Elefant
6           1  2022-01-18             Fish
7           2  2021-05-17             Fish

代码

import pandas as pd


d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
     'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
     'otherInformation': ['Cat', 'Dog', 'Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
    }
df = pd.DataFrame(data=d)
print(df)

df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
df = df.drop_duplicates(subset=['customerId', 'fromDate'], keep='first')
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#print(True in df.index.duplicated())
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
        .set_index('fromDate')
        .groupby('customerId')['lastInteractivity']
        .apply(lambda x: x.asfreq('d'))
        .reset_index())

print(df)

我有什么

     customerId   fromDate lastInteractivity
0             1 2021-02-10          477 days
1             1 2021-02-11               NaT
2             1 2021-02-12               NaT
3             1 2021-02-13               NaT
4             1 2021-02-14               NaT
..          ...        ...               ...
339           1 2022-01-15               NaT
340           1 2022-01-16               NaT
341           1 2022-01-17               NaT
342           1 2022-01-18          135 days
343           2 2021-05-17          381 days

[344 rows x 3 columns]

我想要的

   customerId    fromDate otherInformation    lastInDays
0           1  2021-02-22              Cat          12 #last date 2021-02-10         
1           1  2021-03-18              Dog          36 #last date 2021-02-22
2           1  2021-03-22              Cat           4 #last date 2021-03-18
3           1  2021-02-10              Cat        None #last date not found
4           1  2021-09-07              Cat         169 #last date 2021-03-22
5           1        None          Elefant        None #was None
6           1  2022-01-18             Fish         133 #last date 2021-09-07
7           2  2021-05-17             Fish        None #last date not found

Sort 按日期列的数据框,然后按 customerIdshift 日期列分组,然后从原始日期列中减去它以获得天数差异

df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()

   customerId   fromDate otherInformation lastindays
0           1 2021-02-22              Cat    12 days
1           1 2021-03-18              Dog    24 days
2           1 2021-03-22              Cat     4 days
3           1 2021-02-10              Cat        NaT
4           1 2021-09-07              Cat   169 days
5           1        NaT          Elefant        NaT
6           1 2022-01-18             Fish   133 days
7           2 2021-05-17             Fish        NaT