计算 Pandas 中同一客户(同一组)上次访问时间与当前访问时间之间的天数差异

Calculating the days difference between previous visit time to current visit time for same customer (same group) in Pandas

我正在尝试计算客户上次访问时间与客户最近一次访问之间的时间差(以天为单位)。

time difference = latest in time - previous out time

这是输入数据的示例

样本输出table

到目前为止我尝试过的基于客户ID和排名的groupby方法

temp['RANK'] = temp.groupby('customer ID')['in time'].rank(ascending=True)

但我不确定如何计算差异。

您可以尝试以下方法:

temp.groupby('customer ID').apply(lambda x: (x['in time'].max() - x['out time'].min()).days )

可以用GroupBy.shift() to get the previous out time within the group. Substracted by current in time. Then, use dt.days获取组内in timeout time之间的timedelta的天数,如下:

# convert date strings to datetime format
df['out time'] = pd.to_datetime(df['out time'], dayfirst=True)
df['in time'] = pd.to_datetime(df['in time'], dayfirst=True)

df['Visit diff (in days)'] = (df['in time'] - df['out time'].groupby(df['customer ID']).shift()).dt.days

数据输入:

print(df)

   customer ID             out time              in time
0            1  05-12-1999 15:20:07  05-12-1999 14:23:31
1            1  21-12-1999 09:59:34  21-12-1999 09:41:09
2            2  05-12-1999 11:53:34  05-12-1999 11:05:37
3            2  08-12-1999 19:55:00  08-12-1999 19:40:10
4            3  01-12-1999 15:15:26  01-12-1999 13:08:11
5            3  16-12-1999 17:10:09  16-12-1999 16:34:10

结果:

print(df)

   customer ID            out time             in time  Visit diff (in days)
0            1 1999-12-05 15:20:07 1999-12-05 14:23:31                   NaN
1            1 1999-12-21 09:59:34 1999-12-21 09:41:09                  15.0
2            2 1999-12-05 11:53:34 1999-12-05 11:05:37                   NaN
3            2 1999-12-08 19:55:00 1999-12-08 19:40:10                   3.0
4            3 1999-12-01 15:15:26 1999-12-01 13:08:11                   NaN
5            3 1999-12-16 17:10:09 1999-12-16 16:34:10                  15.0