分组后更新数据框行的更快方法
Faster way to update dataframe rows after group by
示例输入:
GAME_ID TIME PER EVENT
0 0022000394 12:00 1 12
1 0022000394 12:00 1 10
2 0022000394 11:36 1 1
3 0022000394 11:24 1 1
4 0022000394 11:04 1 1
5 0022000394 10:41 1 1
6 0022000394 10:30 1 2
7 0022000394 10:29 1 4
8 0022000394 10:17 1 1
9 0022000394 10:01 1 1
10 0022000394 9:48 1 2
11 0022000394 9:46 1 4
12 0022000394 9:42 1 6
13 0022000394 9:42 1 3
14 0022000394 9:42 1 3
15 0022000394 9:25 1 1
16 0022000394 9:15 1 1
17 0022000394 9:15 1 6
18 0022000394 9:15 1 3
19 0022000394 8:53 1 1
20 0022000394 8:33 1 1
21 0022000394 8:22 1 1
22 0022000394 8:16 1 2
23 0022000394 8:16 1 4
24 0022000394 8:12 1 2
我有一个数据框,我使用 groupby
.
获取一组行
如果该组包含 3 行,其中列 EVENTMSGTYPE
包含所有 [1, 6, & 3]
我想更新原始数据框中的行 EVENTMSGTYPE == 1
当前工作解决方案(慢)
# Group by
for _, data in df.groupby(['GAME_ID', 'TIME', 'PER']):
# If EVENT in group contains 1, 6, and 3 then update original df
if all(x in list(data.EVENT) for x in [1, 6, 3]):
# Update original df row where EVENT equals 1, should only have one value
index = data[data.EVENT == 1].index.values[0]
# Set UPDATED to True
df.at[index, 'UPDATED'] = True
预期输出:
GAME_ID TIME PER EVENT UPDATED
...
16 0022000394 9:15 1 1 True
...
我的数据框有 1,694,389 行,这需要大约 53 秒才能在我的机器上 运行,可以提高它的性能吗?
idx_cols = ['GAME_ID', 'TIME', 'PER']
df = df.set_index(idx_cols)
cond1 = (
df.groupby(level=idx_cols)['EVENT']
.agg(lambda event_group: all(x in event_group for x in [1, 6, 3]))
.reindex_like(df)
)
cond2 = df['EVENT'].eq(1)
df['UPDATED'] = cond1 & cond2
df = df.reset_index()
print(df)
输出:
GAME_ID TIME PER EVENT UPDATED
0 22000394 12:00 1 12 False
1 22000394 12:00 1 10 False
2 22000394 11:36 1 1 False
3 22000394 11:24 1 1 False
4 22000394 11:04 1 1 False
5 22000394 10:41 1 1 False
6 22000394 10:30 1 2 False
7 22000394 10:29 1 4 False
8 22000394 10:17 1 1 False
9 22000394 10:01 1 1 False
10 22000394 9:48 1 2 False
11 22000394 9:46 1 4 False
12 22000394 9:42 1 6 False
13 22000394 9:42 1 3 False
14 22000394 9:42 1 3 False
15 22000394 9:25 1 1 False
16 22000394 9:15 1 1 True
17 22000394 9:15 1 6 False
18 22000394 9:15 1 3 False
19 22000394 8:53 1 1 False
20 22000394 8:33 1 1 False
21 22000394 8:22 1 1 False
22 22000394 8:16 1 2 False
23 22000394 8:16 1 4 False
24 22000394 8:12 1 2 False
df['UPDATED'] = df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].filter(lambda x: set(x) >= {1,3,6}, dropna=False).eq(1)
输出:
GAME_ID TIME PER EVENT UPDATED
0 22000394 12:00 1 12 False
1 22000394 12:00 1 10 False
2 22000394 11:36 1 1 False
3 22000394 11:24 1 1 False
4 22000394 11:04 1 1 False
5 22000394 10:41 1 1 False
6 22000394 10:30 1 2 False
7 22000394 10:29 1 4 False
8 22000394 10:17 1 1 False
9 22000394 10:01 1 1 False
10 22000394 9:48 1 2 False
11 22000394 9:46 1 4 False
12 22000394 9:42 1 6 False
13 22000394 9:42 1 3 False
14 22000394 9:42 1 3 False
15 22000394 9:25 1 1 False
16 22000394 9:15 1 1 True
17 22000394 9:15 1 6 False
18 22000394 9:15 1 3 False
19 22000394 8:53 1 1 False
20 22000394 8:33 1 1 False
21 22000394 8:22 1 1 False
22 22000394 8:16 1 2 False
23 22000394 8:16 1 4 False
24 22000394 8:12 1 2 False
从sammywemmy那里偷了set逻辑~
一个选项是使用 set
进行转换;速度方面我希望 Bert2ME 的解决方案更快:
df.assign(UPDATED = df.groupby(grouper)
.EVENT
.transform(lambda x: set(x) >= {1,3,6})
& df.EVENT.eq(1))
GAME_ID TIME PER EVENT UPDATED
0 22000394 12:00 1 12 False
1 22000394 12:00 1 10 False
2 22000394 11:36 1 1 False
3 22000394 11:24 1 1 False
4 22000394 11:04 1 1 False
5 22000394 10:41 1 1 False
6 22000394 10:30 1 2 False
7 22000394 10:29 1 4 False
8 22000394 10:17 1 1 False
9 22000394 10:01 1 1 False
10 22000394 9:48 1 2 False
11 22000394 9:46 1 4 False
12 22000394 9:42 1 6 False
13 22000394 9:42 1 3 False
14 22000394 9:42 1 3 False
15 22000394 9:25 1 1 False
16 22000394 9:15 1 1 True
17 22000394 9:15 1 6 False
18 22000394 9:15 1 3 False
19 22000394 8:53 1 1 False
20 22000394 8:33 1 1 False
21 22000394 8:22 1 1 False
22 22000394 8:16 1 2 False
23 22000394 8:16 1 4 False
24 22000394 8:12 1 2 False
示例输入:
GAME_ID TIME PER EVENT
0 0022000394 12:00 1 12
1 0022000394 12:00 1 10
2 0022000394 11:36 1 1
3 0022000394 11:24 1 1
4 0022000394 11:04 1 1
5 0022000394 10:41 1 1
6 0022000394 10:30 1 2
7 0022000394 10:29 1 4
8 0022000394 10:17 1 1
9 0022000394 10:01 1 1
10 0022000394 9:48 1 2
11 0022000394 9:46 1 4
12 0022000394 9:42 1 6
13 0022000394 9:42 1 3
14 0022000394 9:42 1 3
15 0022000394 9:25 1 1
16 0022000394 9:15 1 1
17 0022000394 9:15 1 6
18 0022000394 9:15 1 3
19 0022000394 8:53 1 1
20 0022000394 8:33 1 1
21 0022000394 8:22 1 1
22 0022000394 8:16 1 2
23 0022000394 8:16 1 4
24 0022000394 8:12 1 2
我有一个数据框,我使用 groupby
.
如果该组包含 3 行,其中列 EVENTMSGTYPE
包含所有 [1, 6, & 3]
我想更新原始数据框中的行 EVENTMSGTYPE == 1
当前工作解决方案(慢)
# Group by
for _, data in df.groupby(['GAME_ID', 'TIME', 'PER']):
# If EVENT in group contains 1, 6, and 3 then update original df
if all(x in list(data.EVENT) for x in [1, 6, 3]):
# Update original df row where EVENT equals 1, should only have one value
index = data[data.EVENT == 1].index.values[0]
# Set UPDATED to True
df.at[index, 'UPDATED'] = True
预期输出:
GAME_ID TIME PER EVENT UPDATED
...
16 0022000394 9:15 1 1 True
...
我的数据框有 1,694,389 行,这需要大约 53 秒才能在我的机器上 运行,可以提高它的性能吗?
idx_cols = ['GAME_ID', 'TIME', 'PER']
df = df.set_index(idx_cols)
cond1 = (
df.groupby(level=idx_cols)['EVENT']
.agg(lambda event_group: all(x in event_group for x in [1, 6, 3]))
.reindex_like(df)
)
cond2 = df['EVENT'].eq(1)
df['UPDATED'] = cond1 & cond2
df = df.reset_index()
print(df)
输出:
GAME_ID TIME PER EVENT UPDATED
0 22000394 12:00 1 12 False
1 22000394 12:00 1 10 False
2 22000394 11:36 1 1 False
3 22000394 11:24 1 1 False
4 22000394 11:04 1 1 False
5 22000394 10:41 1 1 False
6 22000394 10:30 1 2 False
7 22000394 10:29 1 4 False
8 22000394 10:17 1 1 False
9 22000394 10:01 1 1 False
10 22000394 9:48 1 2 False
11 22000394 9:46 1 4 False
12 22000394 9:42 1 6 False
13 22000394 9:42 1 3 False
14 22000394 9:42 1 3 False
15 22000394 9:25 1 1 False
16 22000394 9:15 1 1 True
17 22000394 9:15 1 6 False
18 22000394 9:15 1 3 False
19 22000394 8:53 1 1 False
20 22000394 8:33 1 1 False
21 22000394 8:22 1 1 False
22 22000394 8:16 1 2 False
23 22000394 8:16 1 4 False
24 22000394 8:12 1 2 False
df['UPDATED'] = df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].filter(lambda x: set(x) >= {1,3,6}, dropna=False).eq(1)
输出:
GAME_ID TIME PER EVENT UPDATED
0 22000394 12:00 1 12 False
1 22000394 12:00 1 10 False
2 22000394 11:36 1 1 False
3 22000394 11:24 1 1 False
4 22000394 11:04 1 1 False
5 22000394 10:41 1 1 False
6 22000394 10:30 1 2 False
7 22000394 10:29 1 4 False
8 22000394 10:17 1 1 False
9 22000394 10:01 1 1 False
10 22000394 9:48 1 2 False
11 22000394 9:46 1 4 False
12 22000394 9:42 1 6 False
13 22000394 9:42 1 3 False
14 22000394 9:42 1 3 False
15 22000394 9:25 1 1 False
16 22000394 9:15 1 1 True
17 22000394 9:15 1 6 False
18 22000394 9:15 1 3 False
19 22000394 8:53 1 1 False
20 22000394 8:33 1 1 False
21 22000394 8:22 1 1 False
22 22000394 8:16 1 2 False
23 22000394 8:16 1 4 False
24 22000394 8:12 1 2 False
从sammywemmy那里偷了set逻辑~
一个选项是使用 set
进行转换;速度方面我希望 Bert2ME 的解决方案更快:
df.assign(UPDATED = df.groupby(grouper)
.EVENT
.transform(lambda x: set(x) >= {1,3,6})
& df.EVENT.eq(1))
GAME_ID TIME PER EVENT UPDATED
0 22000394 12:00 1 12 False
1 22000394 12:00 1 10 False
2 22000394 11:36 1 1 False
3 22000394 11:24 1 1 False
4 22000394 11:04 1 1 False
5 22000394 10:41 1 1 False
6 22000394 10:30 1 2 False
7 22000394 10:29 1 4 False
8 22000394 10:17 1 1 False
9 22000394 10:01 1 1 False
10 22000394 9:48 1 2 False
11 22000394 9:46 1 4 False
12 22000394 9:42 1 6 False
13 22000394 9:42 1 3 False
14 22000394 9:42 1 3 False
15 22000394 9:25 1 1 False
16 22000394 9:15 1 1 True
17 22000394 9:15 1 6 False
18 22000394 9:15 1 3 False
19 22000394 8:53 1 1 False
20 22000394 8:33 1 1 False
21 22000394 8:22 1 1 False
22 22000394 8:16 1 2 False
23 22000394 8:16 1 4 False
24 22000394 8:12 1 2 False