pandas 标记事件之间的行

Question

创建数据

actions = ['Start','Action1','Action2','Pause','Actoin2','Resume','Action1','Finish','Start','Action1','Finish']
start_date = datetime.datetime.strptime('14/10/21 09:00:00', '%d/%m/%y %H:%M:%S')
date_list = [start_date + datetime.timedelta(seconds=x) for x in range(0,11)]
values = [1,1,2,1,2,1,5,1,1,1,1]
df = pd.DataFrame({'ActionType': actions,
                   'Timestamp': date_list,
                   'Value': values})

ActionType	Timestamp	Value
Start	2021-10-14 09:00:00	1
Action1	2021-10-14 09:00:01	1
Action2	2021-10-14 09:00:02	2
Pause	2021-10-14 09:00:03	1
Action2	2021-10-14 09:00:04	2
Restart	2021-10-14 09:00:05	1
Action1	2021-10-14 09:00:06	5
Finish	2021-10-14 09:00:07	1
Start	2021-10-14 09:00:08	1
Action1	2021-10-14 09:00:09	1
Finish	2021-10-14 09:00:010	1

看看有两个“会话”是如何进行的。我想在新栏中标记每个会话。

如何获取开始行和结束行之间的行？（假设已排序）
同样，如何过滤掉会话中的暂停？例如，要计算 RealTimeElapsed 列（或简单地制作一个 DuringPause 布尔列）

输出应如下所示：

output = pd.DataFrame({'Session:': [0,0,0,0,0,0,0,0,1,1,1],
                       'ActionType': actions,
                       'Timestamp': date_list,
                       'RealTimeElapsed': [0,1,2,3,3,3,4,5,0,1,2],
                       'Value': values
                      })

Session	ActionType	Timestamp	RealTimeElapsed	Value
0	Start	2021-10-14 09:00:00	0	1
0	Action1	2021-10-14 09:00:01	1	1
0	Action2	2021-10-14 09:00:02	2	1
0	Pause	2021-10-14 09:00:03	3	1
0	Action2	2021-10-14 09:00:04	3	1
0	Resume	2021-10-14 09:00:05	3	1
0	Action1	2021-10-14 09:00:06	4	1
0	Finish	2021-10-14 09:00:07	5	1
1	Start	2021-10-14 09:00:08	0	1
1	Action1	2021-10-14 09:00:09	1	1
1	Finish	2021-10-14 09:00:010	2	1

已考虑：

循环：这是一个糟糕的做法（我的数据非常大），但如果这是唯一可行的解决方案，请告诉我。
Shift：pandas有shift功能，但是我只知道固定行数怎么用，没有条件的东西（比如start/finish/pause/resume）我可以区别对待
Groupby('Type').unstack() 并在时间之间取差：我不能这样做，因为我需要维护值列

Answer 1

我的解决方案是：

df.insert(0,"Session",np.where(df["ActionType"]=="Start",1,0).cumsum()-1)

def ufunc(df):
    a = list()
    for i,j in df.groupby(df.ActionType.isin(["Start","Pause","Resume"]).cumsum()):
        if j.ActionType.iloc[0] == "Start":
            a.extend(np.array(range(len(j))))
        elif j.ActionType.iloc[0] == "Pause":
            a.extend([max(a)+1] * len(j))
        elif j.ActionType.iloc[0] == "Resume":
            a.extend(np.array(range(len(j))) + max(a))
    return a

df.insert(len(df.columns)-1,"RealTimeElapsed",df.groupby("Session").apply(ufunc).explode().values)

df

    Session ActionType  Timestamp   RealTimeElapsed   Value
   0    0   Start    2021-10-14 09:00:00    0            1
   1    0   Action1  2021-10-14 09:00:01    1            1
   2    0   Action2  2021-10-14 09:00:02    2            2
   3    0   Pause    2021-10-14 09:00:03    3            1
   4    0   Actoin3  2021-10-14 09:00:04    3            2
   5    0   Resume   2021-10-14 09:00:05    3            1
   6    0   Action1  2021-10-14 09:00:06    4            5
   7    0   Finish   2021-10-14 09:00:07    5            1
   8    1   Start    2021-10-14 09:00:08    0            1
   9    1   Action1  2021-10-14 09:00:09    1            1
   10   1   Finish   2021-10-14 09:00:10    2            1

pandas 标记事件之间的行

pandas mark rows between events

time-series

categories

pandas