Return 每个 id 的持续时间
Return duration for each id
我有一大堆正在跟踪的事件,每个事件都附加了时间戳:
我目前有以下table:
ID Time_Stamp Event
1 2/20/2019 18:21 0
1 2/20/2019 19:46 0
1 2/21/2019 18:35 0
1 2/22/2019 11:39 1
1 2/22/2019 16:46 0
1 2/23/2019 7:40 0
2 6/5/2019 0:10 0
3 7/31/2019 10:18 0
3 8/23/2019 16:33 0
4 6/26/2019 20:49 0
我想要的是下面的[但不确定是否可以]:
ID Time_Stamp Conversion Total_Duration_Days Conversion_Duration
1 2/20/2019 18:21 0 2.555 1.721
1 2/20/2019 19:46 0 2.555 1.721
1 2/21/2019 18:35 0 2.555 1.721
1 2/22/2019 11:39 1 2.555 1.721
1 2/22/2019 16:46 1 2.555 1.934
1 2/23/2019 7:40 0 2.555 1.934
2 6/5/2019 0:10 0 1.00 0.000
3 7/31/2019 10:18 0 23.260 0.000
3 8/23/2019 16:33 0 23.260 0.000
4 6/26/2019 20:49 0 1.00 0.000
#1 总持续时间 = Max Date - Min Date
[2.555 天]
对于#2 转化持续时间 = Conversion Date - Min Date
[1.721 天] - 执行以下操作post 转化可以保持计算的持续时间
我尝试了以下操作:
df.reset_index(inplace=True)
df.groupby(['ID'])['Time_Stamp].diff().fillna(0)
这是我想要的,但它显示了每个事件之间的差异,而不是最小时间戳与最大时间戳
conv_test = df.reset_index(inplace=True)
min_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('min').to_frame('MinTime')
max_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('max').to_frame('MaxTime')
conv_test = conv_test.set_index('ID').merge(min_df, left_index=True, right_index=True)
conv_test = conv_test.merge(max_df, left_index=True, right_index=True)
conv_test['Durartion'] = conv_test['MaxTime'] - conv_test['MinTime']
这给了我 Total_Duration_Days
这很棒 [随时提供更优雅的解决方案
关于如何获得 Conversion_Duration
的任何想法?
您可以对新系列使用 GroupBy.transform
with min
and max
for Series
with same size like original, so possible subtract for Total_Duration_Days
and then filter only 1
rows by Event
, create Series
by DataFrame.set_index
and convert to dict
, then Series.map
,因此可以减去每组的最小值:
df['Time_Stamp'] = pd.to_datetime(df['Time_Stamp'])
min1 = df.groupby('ID')['Time_Stamp'].transform('min')
max1 = df.groupby('ID')['Time_Stamp'].transform('max')
df['Total_Duration_Days'] = max1.sub(min1).dt.total_seconds() / (3600 * 24)
d = df.loc[df['Event'] == 1].set_index('ID')['Time_Stamp'].to_dict()
new1 = df['ID'].map(d)
因为每个组可能有多个 1
仅针对此组添加解决方案 - 测试,如果掩码中每个组更多 1
,请获取系列 new2
,然后使用 Series.combine_first
与 mapped
系列 new1
。
原因是提高性能,因为处理倍数1有点复杂。
mask = df['Event'].eq(1).groupby(df['ID']).transform('sum').gt(1)
g = df[mask].groupby('ID')['Event'].cumsum().replace({0:np.nan})
new2 = (df[mask].groupby(['ID', g])['Time_Stamp']
.transform('first')
.groupby(df['ID'])
.bfill())
df['Conversion_Duration'] = (new2.combine_first(new1)
.sub(min1)
.dt.total_seconds().fillna(0) / (3600 * 24))
print (df)
ID Time_Stamp Event Total_Duration_Days Conversion_Duration
0 1 2019-02-20 18:21:00 0 2.554861 1.720833
1 1 2019-02-20 19:46:00 0 2.554861 1.720833
2 1 2019-02-21 18:35:00 0 2.554861 1.720833
3 1 2019-02-22 11:39:00 1 2.554861 1.720833
4 1 2019-02-22 16:46:00 1 2.554861 1.934028
5 1 2019-02-23 07:40:00 0 2.554861 1.934028
6 2 2019-06-05 00:10:00 0 0.000000 0.000000
7 3 2019-07-31 10:18:00 0 23.260417 0.000000
8 3 2019-08-23 16:33:00 0 23.260417 0.000000
9 4 2019-06-26 20:49:00 0 0.000000 0.000000
我有一大堆正在跟踪的事件,每个事件都附加了时间戳:
我目前有以下table:
ID Time_Stamp Event
1 2/20/2019 18:21 0
1 2/20/2019 19:46 0
1 2/21/2019 18:35 0
1 2/22/2019 11:39 1
1 2/22/2019 16:46 0
1 2/23/2019 7:40 0
2 6/5/2019 0:10 0
3 7/31/2019 10:18 0
3 8/23/2019 16:33 0
4 6/26/2019 20:49 0
我想要的是下面的[但不确定是否可以]:
ID Time_Stamp Conversion Total_Duration_Days Conversion_Duration
1 2/20/2019 18:21 0 2.555 1.721
1 2/20/2019 19:46 0 2.555 1.721
1 2/21/2019 18:35 0 2.555 1.721
1 2/22/2019 11:39 1 2.555 1.721
1 2/22/2019 16:46 1 2.555 1.934
1 2/23/2019 7:40 0 2.555 1.934
2 6/5/2019 0:10 0 1.00 0.000
3 7/31/2019 10:18 0 23.260 0.000
3 8/23/2019 16:33 0 23.260 0.000
4 6/26/2019 20:49 0 1.00 0.000
#1 总持续时间 = Max Date - Min Date
[2.555 天]
对于#2 转化持续时间 = Conversion Date - Min Date
[1.721 天] - 执行以下操作post 转化可以保持计算的持续时间
我尝试了以下操作:
df.reset_index(inplace=True)
df.groupby(['ID'])['Time_Stamp].diff().fillna(0)
这是我想要的,但它显示了每个事件之间的差异,而不是最小时间戳与最大时间戳
conv_test = df.reset_index(inplace=True)
min_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('min').to_frame('MinTime')
max_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('max').to_frame('MaxTime')
conv_test = conv_test.set_index('ID').merge(min_df, left_index=True, right_index=True)
conv_test = conv_test.merge(max_df, left_index=True, right_index=True)
conv_test['Durartion'] = conv_test['MaxTime'] - conv_test['MinTime']
这给了我 Total_Duration_Days
这很棒 [随时提供更优雅的解决方案
关于如何获得 Conversion_Duration
的任何想法?
您可以对新系列使用 GroupBy.transform
with min
and max
for Series
with same size like original, so possible subtract for Total_Duration_Days
and then filter only 1
rows by Event
, create Series
by DataFrame.set_index
and convert to dict
, then Series.map
,因此可以减去每组的最小值:
df['Time_Stamp'] = pd.to_datetime(df['Time_Stamp'])
min1 = df.groupby('ID')['Time_Stamp'].transform('min')
max1 = df.groupby('ID')['Time_Stamp'].transform('max')
df['Total_Duration_Days'] = max1.sub(min1).dt.total_seconds() / (3600 * 24)
d = df.loc[df['Event'] == 1].set_index('ID')['Time_Stamp'].to_dict()
new1 = df['ID'].map(d)
因为每个组可能有多个 1
仅针对此组添加解决方案 - 测试,如果掩码中每个组更多 1
,请获取系列 new2
,然后使用 Series.combine_first
与 mapped
系列 new1
。
原因是提高性能,因为处理倍数1有点复杂。
mask = df['Event'].eq(1).groupby(df['ID']).transform('sum').gt(1)
g = df[mask].groupby('ID')['Event'].cumsum().replace({0:np.nan})
new2 = (df[mask].groupby(['ID', g])['Time_Stamp']
.transform('first')
.groupby(df['ID'])
.bfill())
df['Conversion_Duration'] = (new2.combine_first(new1)
.sub(min1)
.dt.total_seconds().fillna(0) / (3600 * 24))
print (df)
ID Time_Stamp Event Total_Duration_Days Conversion_Duration
0 1 2019-02-20 18:21:00 0 2.554861 1.720833
1 1 2019-02-20 19:46:00 0 2.554861 1.720833
2 1 2019-02-21 18:35:00 0 2.554861 1.720833
3 1 2019-02-22 11:39:00 1 2.554861 1.720833
4 1 2019-02-22 16:46:00 1 2.554861 1.934028
5 1 2019-02-23 07:40:00 0 2.554861 1.934028
6 2 2019-06-05 00:10:00 0 0.000000 0.000000
7 3 2019-07-31 10:18:00 0 23.260417 0.000000
8 3 2019-08-23 16:33:00 0 23.260417 0.000000
9 4 2019-06-26 20:49:00 0 0.000000 0.000000