Pandas : 如何应用具有多列输入和 where 条件的函数

Question

我有一个pandasdataframe。我想根据年份索引大于特定值的多个列输入生成一个新变量（列）。

下图是我想做的，但我想简化成一个函数，因为现实中的计算比下图复杂，变量名也更长

理想情况下，该函数会将计算拆分为中间临时值（不保存到 df）并跨越多行以使其更易于阅读。例如，可以定义：Share = (df['B']+df['C']) / (df['B']+df['C']+df['D']) 然后是 X = A + Share * E.

我以前使用 apply 将函数应用于 dataframe，但该示例仅使用单个变量作为输入且没有 where 子句，我不知道如何扩展例如。

我如何简单地根据A、B、C、D和[=23=生成X ]，其中 year >= 2020?

import numpy as np
import pandas as pd

np.random.seed(2981)

df = pd.DataFrame({
    'year' : [2018, 2019, 2020, 2021,2018, 2019, 2020, 2021,2018, 2019, 2020, 2021],
    'id'   : ['ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','GHI','GHI','GHI','GHI'],
    'A': np.random.choice(range(100),12),
    'B': np.random.choice(range(100),12),
    'C': np.random.choice(range(100),12),
    'D': np.random.choice(range(100),12),
    'E': np.random.choice(range(100),12),
})
df = df.set_index('year')

df['X'] = np.where( df.index >= 2020,  df['A'] + (df['B']+df['C']) / (df['B']+df['C']+df['D']) * df['E'] , np.nan )

Answer 1

首先，您应该只在必要时使用应用。向量化函数会快得多，您现在在 np.where 语句中编写它的方式利用了这些函数。如果您真的想让您的代码更具可读性（以（可能很小的）时间和内存为代价），您可以创建一个中间列，然后在 np.where 语句中使用它。

df["Share"] = ( df.B + df.C ) / ( df.B + df.C + df.D )
df["X"] = ( df.A + df.Share * df.E ).where( df.index >= 2020 )

但是，要回答您的问题，您可以创建一个自定义函数，然后 apply 将其添加到您的 DataFrame。

def my_func( year,a,b,c,d,e ):
    #This function can be longer and do more things
    return np.nan if year < 2020 else a + ( ( (b + c) / (b + c + d) ) * e )


df['X'] = df.apply( lambda x: my_func( x.name, x.A, x.B, x.C, x.D, x.E ), axis = 1 )

请注意，在将 apply 与 axis = 1 一起使用时，要访问一行的索引，您需要使用 name 属性。

此外，由于应用函数相对较慢，因此创建列来处理一些中间步骤（例如对多个列求和等）可能是值得的，这样就不需要这样做了在每次迭代中。

查看 this answer 以获取更多应用自定义函数的示例。

Pandas : 如何应用具有多列输入和 where 条件的函数

Pandas : How to apply a function with multiple column inputs and where condition

python

apply

pandas