根据条件分组并更新 python pandas

Question

你好，我有一个 pandas 数据框，其中包含一个 ID 列和一个值列。

我想使用矢量化代码对每个 ID 进行“分组”，然后运行对组内每一行的每个值执行一个函数。如果函数中特定 ID returns 的每个值都是 False，我想从原始数据框中删除 ID 和所有对应的行。

def funct1(num):
    num = num**.5
    if num - int(num) == 0:
        return True
    else:
        return False

import pandas as pd
df = {
   'ID':['1','1','1','1','1','2','2','2','2','2','3','3','3','3','3'],
   'Percentage':[7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]}
df = pd.DataFrame(df)

最终结果应该是：

df = {
   'ID':['1','1','1','1','1','2','2','2','2','2'],
   'Percentage':[7,8,9,10,11,12,13,14,15,16]}
df = pd.DataFrame(df)

在这里我们看到ID 3的每一行都被删除了，因为组内没有以int作为平方根的值。

我也想维护索引，因为在真实数据中，值不是像这个例子那样有序的。

答案应该是 return df 而不是 groupby 对象。

感谢您的帮助！

最佳

Answer 1

在你的情况下 groupby 和 filter

out = df.groupby('ID').filter(lambda x : any((x['Percentage']**0.5).map(float.is_integer)))
Out[317]: 
  ID  Percentage
0  1           7
1  1           8
2  1           9
3  1          10
4  1          11
5  2          12
6  2          13
7  2          14
8  2          15
9  2          16

Answer 2

你不需要 groupby:

funct1 = lambda pct: pct.pow(0.5) - pct.pow(0.5).astype(int) == 0
out = df[df['ID'].isin(df.loc[funct1(df['Percentage']), 'ID'])]

>>> out
  ID  Percentage
0  1           7
1  1           8
2  1           9
3  1          10
4  1          11
5  2          12
6  2          13
7  2          14
8  2          15
9  2          16

性能

# @BENY
%timeit df.groupby('ID').filter(lambda x : any((x['Percentage']**0.5).map(float.is_integer)))
1.08 ms ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# @Corralien
%timeit df[df['ID'].isin(df.loc[funct1(df['Percentage']), 'ID'])]
651 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

根据条件分组并更新 python pandas

Group by and update based on condition python pandas

python

vectorization

dataframe

pandas