分组后更新数据框行的更快方法

Faster way to update dataframe rows after group by

示例输入:


    GAME_ID     TIME    PER EVENT
0   0022000394  12:00   1   12
1   0022000394  12:00   1   10
2   0022000394  11:36   1   1
3   0022000394  11:24   1   1
4   0022000394  11:04   1   1
5   0022000394  10:41   1   1
6   0022000394  10:30   1   2
7   0022000394  10:29   1   4
8   0022000394  10:17   1   1
9   0022000394  10:01   1   1
10  0022000394  9:48    1   2
11  0022000394  9:46    1   4
12  0022000394  9:42    1   6
13  0022000394  9:42    1   3
14  0022000394  9:42    1   3
15  0022000394  9:25    1   1
16  0022000394  9:15    1   1
17  0022000394  9:15    1   6
18  0022000394  9:15    1   3
19  0022000394  8:53    1   1
20  0022000394  8:33    1   1
21  0022000394  8:22    1   1
22  0022000394  8:16    1   2
23  0022000394  8:16    1   4
24  0022000394  8:12    1   2

我有一个数据框,我使用 groupby.

获取一组行

如果该组包含 3 行,其中列 EVENTMSGTYPE 包含所有 [1, 6, & 3] 我想更新原始数据框中的行 EVENTMSGTYPE == 1

当前工作解决方案(慢)

# Group by
for _, data in df.groupby(['GAME_ID', 'TIME', 'PER']):

    # If EVENT in group contains 1, 6, and 3 then update original df
    if all(x in list(data.EVENT) for x in [1, 6, 3]):

        # Update original df row where EVENT equals 1, should only have one value
        index = data[data.EVENT == 1].index.values[0]

        # Set UPDATED to True
        df.at[index, 'UPDATED'] = True

预期输出:

    GAME_ID     TIME    PER EVENT UPDATED
...
16  0022000394  9:15    1   1     True
...

我的数据框有 1,694,389 行,这需要大约 53 秒才能在我的机器上 运行,可以提高它的性能吗?

idx_cols = ['GAME_ID', 'TIME', 'PER']

df = df.set_index(idx_cols)

cond1 = (
    df.groupby(level=idx_cols)['EVENT']
      .agg(lambda event_group: all(x in event_group for x in [1, 6, 3])) 
      .reindex_like(df)
)

cond2 = df['EVENT'].eq(1)

df['UPDATED'] = cond1 & cond2

df = df.reset_index()
print(df)

输出:

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False
df['UPDATED'] = df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].filter(lambda x: set(x) >= {1,3,6}, dropna=False).eq(1)

输出:

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False

从sammywemmy那里偷了set逻辑~

一个选项是使用 set 进行转换;速度方面我希望 Bert2ME 的解决方案更快:

df.assign(UPDATED = df.groupby(grouper)
                      .EVENT
                      .transform(lambda x: set(x) >= {1,3,6}) 
                                           & df.EVENT.eq(1))

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False