Pandas 对具有相同列值的行进行分组，并将函数应用于第一行，然后将结果分配给左侧的行

Question

我有一个df，密码是：

import pandas as pd
from io import StringIO
    
        df = """
  ValOption  RB test contrat
0       SLA  4  3    23
1       AC   5  4    12
2       SLA  5  5    23
3       AC   2  4    39
4       SLA  5  5    26
5       AC   5  4    52
6       SLA  4  3    64

"""
df = pd.read_csv(StringIO(df.strip()), sep='\s+')

输出：

      ValOption  RB  test  contrat
0       SLA   4     3       23
1        AC   5     4       12
2       SLA   5     5       23
3        AC   2     4       39
4       SLA   5     5       26
5        AC   5     4       52
6       SLA   4     3       64

现在我想将具有相同 'ValOption' 和 'RB' 的行组合在一起：

df.sort_values(['ValOption', 'RB']).set_index(['ValOption', 'RB'])

输出：

注意：由于有很多行并且 RB 的值是动态的所以我不能使用类似的东西：

df.loc[df['ValOption']=='AC'&df['RB']==5]

现在我想对每一行应用一个函数：

    def func(row):
        v1=row['RB']*3
        v2=row['test']-1
        return v1+v2
    df['new_col']==df.apply(lambda row:func(row), axis=1)

输出：

   ValOption    RB  test    contrat new_col
0   SLA          4  3         23    14
1   AC           5  4         12    18
2   SLA          5  5         23    19
3   AC           2  4         39    9
4   SLA          5  5         26    19
5   AC           5  4         52    18
6   SLA          4  3         64    14

然而，在我的实际业务中，规模非常大且极其复杂，这就是我只使用 pd.apply 函数将函数应用于每一行的原因，我知道这种方法比较慢麻木的：

def func():
   v1=df['RB'].values*3
   v2=df['test'].values-1
   return v1+v2
df['new_col']=func(df)

这段代码会得到相同的结果并且速度更快，但是我的函数太复杂我只能使用pands.apply函数，我已经试了2周了。

所以我的问题是，在我将具有相同列值的行分组后，如何将函数应用于该组并且只应用于第一行，这样我就不需要重复计算，因为变量是同样，我直接将第一行计算结果赋值给其他具有相同列值的行。

最终目标是在程序运行大文件时节省时间。

代码是这样的：

new_value_of_first_row=group['SLA'].each_group(group:RB==4 and group:RB==5).first_row.apply(lambda row: func(row),axis=1)


new_value_of_other_rows = new_value_of_first_row

Answer 1

IIUC，您只想对每个组的第一项应用一个函数。您可以创建一个掩码（使用 groupby+cumcount），然后使用 where（或 mask）将函数的输出分配给选定的行：

df['RB_new'] = df['RB'].where(df.groupby(['ValOption', 'RB']).cumcount().ne(0),
                              df['RB']*3 # to replace with your (vector) function
                              )

输出（为清楚起见，在新列中）：

  ValOption  RB  RB_new
0       SLA   4      12
1        AC   5      15
2       SLA   5      15
3        PG   5      15
4       SLA   5       5
5        PC   4      12
6       SLA   4       4
7        AC   5       5
8        PC   4       4

备选方案（对非向量函数有用）：

def func(s):
    s = s.copy()
    s.iloc[0] *= 3
    return s

df['RB_new'] = (df.groupby(['ValOption', 'RB'],
                           as_index=False, sort=False)['RB']
                  .transform(func)
                )

Answer 2

这是一种删除数据帧上的重复项、执行计算，然后将重复项恢复为原始形状的方法。

为此，它保存每组第一行的索引，然后drop_duplicates，计算后，reindex：

# save indexer
idx = df.groupby(['ValOption', 'RB', 'test'])['ValOption'].transform(lambda s: s.index[0])

# drop duplicates
df2 = df.drop_duplicates().copy()

# perform computation (to replace with the actual function, eventually with apply)
df2['newcol'] = df2['RB']*3+df2['test']-1

# reindex to original shape
df2.reindex(idx).reset_index(drop=True)

Pandas 对具有相同列值的行进行分组，并将函数应用于第一行，然后将结果分配给左侧的行

Pandas group rows with the same column values and apply function to the first row then assignment result to the left rows

python

numpy

dataframe

pandas

numpy-ndarray

备选方案（对非向量函数有用）：