计算按客户分组的天数与最后日期的差异
Calculate the difference to the last date in days grouped by customer
我有问题。我想得到最后日期的差异。例如 2021-03-22
到下一个日期 (2021-03-18
) 是 4 天。我想计算 customerId
的行日期和最后日期之间的天数差异。所以完整的计算应该针对每个客户。最后一个日期应该是 None
因为我没有更早的日期。
数据框
customerId fromDate otherInformation
0 1 2021-02-22 Cat
1 1 2021-03-18 Dog
2 1 2021-03-22 Cat
3 1 2021-02-10 Cat
4 1 2021-09-07 Cat
5 1 None Elefant
6 1 2022-01-18 Fish
7 2 2021-05-17 Fish
代码
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
'otherInformation': ['Cat', 'Dog', 'Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
}
df = pd.DataFrame(data=d)
print(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
df = df.drop_duplicates(subset=['customerId', 'fromDate'], keep='first')
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#print(True in df.index.duplicated())
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
print(df)
我有什么
customerId fromDate lastInteractivity
0 1 2021-02-10 477 days
1 1 2021-02-11 NaT
2 1 2021-02-12 NaT
3 1 2021-02-13 NaT
4 1 2021-02-14 NaT
.. ... ... ...
339 1 2022-01-15 NaT
340 1 2022-01-16 NaT
341 1 2022-01-17 NaT
342 1 2022-01-18 135 days
343 2 2021-05-17 381 days
[344 rows x 3 columns]
我想要的
customerId fromDate otherInformation lastInDays
0 1 2021-02-22 Cat 12 #last date 2021-02-10
1 1 2021-03-18 Dog 36 #last date 2021-02-22
2 1 2021-03-22 Cat 4 #last date 2021-03-18
3 1 2021-02-10 Cat None #last date not found
4 1 2021-09-07 Cat 169 #last date 2021-03-22
5 1 None Elefant None #was None
6 1 2022-01-18 Fish 133 #last date 2021-09-07
7 2 2021-05-17 Fish None #last date not found
Sort
按日期列的数据框,然后按 customerId
和 shift
日期列分组,然后从原始日期列中减去它以获得天数差异
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-03-18 Dog 24 days
2 1 2021-03-22 Cat 4 days
3 1 2021-02-10 Cat NaT
4 1 2021-09-07 Cat 169 days
5 1 NaT Elefant NaT
6 1 2022-01-18 Fish 133 days
7 2 2021-05-17 Fish NaT
我有问题。我想得到最后日期的差异。例如 2021-03-22
到下一个日期 (2021-03-18
) 是 4 天。我想计算 customerId
的行日期和最后日期之间的天数差异。所以完整的计算应该针对每个客户。最后一个日期应该是 None
因为我没有更早的日期。
数据框
customerId fromDate otherInformation
0 1 2021-02-22 Cat
1 1 2021-03-18 Dog
2 1 2021-03-22 Cat
3 1 2021-02-10 Cat
4 1 2021-09-07 Cat
5 1 None Elefant
6 1 2022-01-18 Fish
7 2 2021-05-17 Fish
代码
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
'otherInformation': ['Cat', 'Dog', 'Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
}
df = pd.DataFrame(data=d)
print(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
df = df.drop_duplicates(subset=['customerId', 'fromDate'], keep='first')
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#print(True in df.index.duplicated())
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
print(df)
我有什么
customerId fromDate lastInteractivity
0 1 2021-02-10 477 days
1 1 2021-02-11 NaT
2 1 2021-02-12 NaT
3 1 2021-02-13 NaT
4 1 2021-02-14 NaT
.. ... ... ...
339 1 2022-01-15 NaT
340 1 2022-01-16 NaT
341 1 2022-01-17 NaT
342 1 2022-01-18 135 days
343 2 2021-05-17 381 days
[344 rows x 3 columns]
我想要的
customerId fromDate otherInformation lastInDays
0 1 2021-02-22 Cat 12 #last date 2021-02-10
1 1 2021-03-18 Dog 36 #last date 2021-02-22
2 1 2021-03-22 Cat 4 #last date 2021-03-18
3 1 2021-02-10 Cat None #last date not found
4 1 2021-09-07 Cat 169 #last date 2021-03-22
5 1 None Elefant None #was None
6 1 2022-01-18 Fish 133 #last date 2021-09-07
7 2 2021-05-17 Fish None #last date not found
Sort
按日期列的数据框,然后按 customerId
和 shift
日期列分组,然后从原始日期列中减去它以获得天数差异
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-03-18 Dog 24 days
2 1 2021-03-22 Cat 4 days
3 1 2021-02-10 Cat NaT
4 1 2021-09-07 Cat 169 days
5 1 NaT Elefant NaT
6 1 2022-01-18 Fish 133 days
7 2 2021-05-17 Fish NaT