pandas 标记事件之间的行
pandas mark rows between events
创建数据
actions = ['Start','Action1','Action2','Pause','Actoin2','Resume','Action1','Finish','Start','Action1','Finish']
start_date = datetime.datetime.strptime('14/10/21 09:00:00', '%d/%m/%y %H:%M:%S')
date_list = [start_date + datetime.timedelta(seconds=x) for x in range(0,11)]
values = [1,1,2,1,2,1,5,1,1,1,1]
df = pd.DataFrame({'ActionType': actions,
'Timestamp': date_list,
'Value': values})
ActionType
Timestamp
Value
Start
2021-10-14 09:00:00
1
Action1
2021-10-14 09:00:01
1
Action2
2021-10-14 09:00:02
2
Pause
2021-10-14 09:00:03
1
Action2
2021-10-14 09:00:04
2
Restart
2021-10-14 09:00:05
1
Action1
2021-10-14 09:00:06
5
Finish
2021-10-14 09:00:07
1
Start
2021-10-14 09:00:08
1
Action1
2021-10-14 09:00:09
1
Finish
2021-10-14 09:00:010
1
看看有两个“会话”是如何进行的。我想在新栏中标记每个会话。
- 如何获取开始行和结束行之间的行? (假设已排序)
- 同样,如何过滤掉会话中的暂停?例如,要计算 RealTimeElapsed 列(或简单地制作一个 DuringPause 布尔列)
输出应如下所示:
output = pd.DataFrame({'Session:': [0,0,0,0,0,0,0,0,1,1,1],
'ActionType': actions,
'Timestamp': date_list,
'RealTimeElapsed': [0,1,2,3,3,3,4,5,0,1,2],
'Value': values
})
Session
ActionType
Timestamp
RealTimeElapsed
Value
0
Start
2021-10-14 09:00:00
0
1
0
Action1
2021-10-14 09:00:01
1
1
0
Action2
2021-10-14 09:00:02
2
1
0
Pause
2021-10-14 09:00:03
3
1
0
Action2
2021-10-14 09:00:04
3
1
0
Resume
2021-10-14 09:00:05
3
1
0
Action1
2021-10-14 09:00:06
4
1
0
Finish
2021-10-14 09:00:07
5
1
1
Start
2021-10-14 09:00:08
0
1
1
Action1
2021-10-14 09:00:09
1
1
1
Finish
2021-10-14 09:00:010
2
1
已考虑:
- 循环:这是一个糟糕的做法(我的数据非常大),但如果这是唯一可行的解决方案,请告诉我。
- Shift:pandas有shift功能,但是我只知道固定行数怎么用,没有条件的东西(比如start/finish/pause/resume)我可以区别对待
- Groupby('Type').unstack() 并在时间之间取差:我不能这样做,因为我需要维护值列
我的解决方案是:
df.insert(0,"Session",np.where(df["ActionType"]=="Start",1,0).cumsum()-1)
def ufunc(df):
a = list()
for i,j in df.groupby(df.ActionType.isin(["Start","Pause","Resume"]).cumsum()):
if j.ActionType.iloc[0] == "Start":
a.extend(np.array(range(len(j))))
elif j.ActionType.iloc[0] == "Pause":
a.extend([max(a)+1] * len(j))
elif j.ActionType.iloc[0] == "Resume":
a.extend(np.array(range(len(j))) + max(a))
return a
df.insert(len(df.columns)-1,"RealTimeElapsed",df.groupby("Session").apply(ufunc).explode().values)
df
Session ActionType Timestamp RealTimeElapsed Value
0 0 Start 2021-10-14 09:00:00 0 1
1 0 Action1 2021-10-14 09:00:01 1 1
2 0 Action2 2021-10-14 09:00:02 2 2
3 0 Pause 2021-10-14 09:00:03 3 1
4 0 Actoin3 2021-10-14 09:00:04 3 2
5 0 Resume 2021-10-14 09:00:05 3 1
6 0 Action1 2021-10-14 09:00:06 4 5
7 0 Finish 2021-10-14 09:00:07 5 1
8 1 Start 2021-10-14 09:00:08 0 1
9 1 Action1 2021-10-14 09:00:09 1 1
10 1 Finish 2021-10-14 09:00:10 2 1
创建数据
actions = ['Start','Action1','Action2','Pause','Actoin2','Resume','Action1','Finish','Start','Action1','Finish']
start_date = datetime.datetime.strptime('14/10/21 09:00:00', '%d/%m/%y %H:%M:%S')
date_list = [start_date + datetime.timedelta(seconds=x) for x in range(0,11)]
values = [1,1,2,1,2,1,5,1,1,1,1]
df = pd.DataFrame({'ActionType': actions,
'Timestamp': date_list,
'Value': values})
ActionType | Timestamp | Value |
---|---|---|
Start | 2021-10-14 09:00:00 | 1 |
Action1 | 2021-10-14 09:00:01 | 1 |
Action2 | 2021-10-14 09:00:02 | 2 |
Pause | 2021-10-14 09:00:03 | 1 |
Action2 | 2021-10-14 09:00:04 | 2 |
Restart | 2021-10-14 09:00:05 | 1 |
Action1 | 2021-10-14 09:00:06 | 5 |
Finish | 2021-10-14 09:00:07 | 1 |
Start | 2021-10-14 09:00:08 | 1 |
Action1 | 2021-10-14 09:00:09 | 1 |
Finish | 2021-10-14 09:00:010 | 1 |
看看有两个“会话”是如何进行的。我想在新栏中标记每个会话。
- 如何获取开始行和结束行之间的行? (假设已排序)
- 同样,如何过滤掉会话中的暂停?例如,要计算 RealTimeElapsed 列(或简单地制作一个 DuringPause 布尔列)
输出应如下所示:
output = pd.DataFrame({'Session:': [0,0,0,0,0,0,0,0,1,1,1],
'ActionType': actions,
'Timestamp': date_list,
'RealTimeElapsed': [0,1,2,3,3,3,4,5,0,1,2],
'Value': values
})
Session | ActionType | Timestamp | RealTimeElapsed | Value |
---|---|---|---|---|
0 | Start | 2021-10-14 09:00:00 | 0 | 1 |
0 | Action1 | 2021-10-14 09:00:01 | 1 | 1 |
0 | Action2 | 2021-10-14 09:00:02 | 2 | 1 |
0 | Pause | 2021-10-14 09:00:03 | 3 | 1 |
0 | Action2 | 2021-10-14 09:00:04 | 3 | 1 |
0 | Resume | 2021-10-14 09:00:05 | 3 | 1 |
0 | Action1 | 2021-10-14 09:00:06 | 4 | 1 |
0 | Finish | 2021-10-14 09:00:07 | 5 | 1 |
1 | Start | 2021-10-14 09:00:08 | 0 | 1 |
1 | Action1 | 2021-10-14 09:00:09 | 1 | 1 |
1 | Finish | 2021-10-14 09:00:010 | 2 | 1 |
已考虑:
- 循环:这是一个糟糕的做法(我的数据非常大),但如果这是唯一可行的解决方案,请告诉我。
- Shift:pandas有shift功能,但是我只知道固定行数怎么用,没有条件的东西(比如start/finish/pause/resume)我可以区别对待
- Groupby('Type').unstack() 并在时间之间取差:我不能这样做,因为我需要维护值列
我的解决方案是:
df.insert(0,"Session",np.where(df["ActionType"]=="Start",1,0).cumsum()-1)
def ufunc(df):
a = list()
for i,j in df.groupby(df.ActionType.isin(["Start","Pause","Resume"]).cumsum()):
if j.ActionType.iloc[0] == "Start":
a.extend(np.array(range(len(j))))
elif j.ActionType.iloc[0] == "Pause":
a.extend([max(a)+1] * len(j))
elif j.ActionType.iloc[0] == "Resume":
a.extend(np.array(range(len(j))) + max(a))
return a
df.insert(len(df.columns)-1,"RealTimeElapsed",df.groupby("Session").apply(ufunc).explode().values)
df
Session ActionType Timestamp RealTimeElapsed Value
0 0 Start 2021-10-14 09:00:00 0 1
1 0 Action1 2021-10-14 09:00:01 1 1
2 0 Action2 2021-10-14 09:00:02 2 2
3 0 Pause 2021-10-14 09:00:03 3 1
4 0 Actoin3 2021-10-14 09:00:04 3 2
5 0 Resume 2021-10-14 09:00:05 3 1
6 0 Action1 2021-10-14 09:00:06 4 5
7 0 Finish 2021-10-14 09:00:07 5 1
8 1 Start 2021-10-14 09:00:08 0 1
9 1 Action1 2021-10-14 09:00:09 1 1
10 1 Finish 2021-10-14 09:00:10 2 1