如何消除 Pandas Dataframe 中基于多个 if、elif 语句填充列的每一行值的循环

Question

尝试摆脱 for 循环，以加快根据涉及多列和多行的 if、elif 条件填充列 'C' 中的值的执行速度。无法找到合适的解决方案。

尝试应用 np.where 条件、选项和默认值。但未能获得预期结果，因为我无法从 pandas 系列对象中提取单个值。

df = pd.DataFrame()
df['A']=['Yes','Yes','No','No','Yes','No','Yes','Yes','Yes','Yes']
df['B']=[1,1,0,1,1,0,1,0,0,1]
df['C']=None
df['D']=['xyz','Yes','No','xyz','Yes','No','xyz','Yes','Yes','Yes']
df['C'][0]='xyz'
for i in range(0,len(df)-1):
    if (df.iloc[1+i, 1]==1) & (df.iloc[i, 2]=="xyz") & (df.iloc[1+i, 0]=="No"):
        df.iloc[1+i, 2] = "Minus"
    elif (df.iloc[1+i, 1]==1) & (df.iloc[i, 2]=="xyz") & (df.iloc[1+i, 0]=="Yes"):
        df.iloc[1+i, 2] = "Plus"
    elif (df.iloc[i, 3]!="xyz") or ((df.iloc[1+i, 1]==0) & (df.iloc[i, 2]=="xyz")):
        df.iloc[1+i, 2] = "xyz"
    elif (df.iloc[1+i, 0]=="Yes") & (df.iloc[i, 2]=="xyz"):
        df.iloc[1+i, 2] = "Plus"
    elif (df.iloc[1+i, 0]=="No") & (df.iloc[i, 2]=="xyz"):
        df.iloc[1+i, 2] = "Minus"
    else:
        df.iloc[1+i, 2] = df.iloc[i, 2]
df

期待社区的帮助，将上述代码修改为执行时间更短的更好的代码。最好使用 numpy 向量化。

Answer 1

当然不能使用 Numpy 或 Pandas 对循环进行有效矢量化，因为 循环携带的数据依赖于 df['C']。由于 Pandas 直接索引和字符串比较，循环非常慢。希望您可以使用 Numba 有效地解决这个问题。您首先需要将列转换为 strongly-typed Numpy 数组，以便 Numba 有用。请注意，Numba 处理字符串的速度非常慢，因此最好直接使用 Numpy 执行矢量化检查。

这是结果代码：

import numpy as np
import numba as nb

@nb.njit('UnicodeCharSeq(8)[:](bool_[:], int64[:], bool_[:])')
def compute(a, b, d):
    n = a.size
    c = np.empty(n, dtype='U8')
    c[0] = 'xyz'
    for i in range(0, n-1):
        prev_is_xyz = c[i] == 'xyz'
        if b[i+1]==1 and prev_is_xyz and not a[i+1]:
            c[i+1] = 'Minus'
        elif b[i+1]==1 and prev_is_xyz and a[i+1]:
            c[i+1] = 'Plus'
        elif d[i] or (b[i+1]==0 and prev_is_xyz):
            c[i+1] = 'xyz'
        elif a[i+1] and prev_is_xyz:
            c[i+1] = 'Plus'
        elif not a[i+1] and prev_is_xyz:
            c[i+1] = 'Minus'
        else:
            c[i+1] = c[i]
    return c

# Convert the dataframe columns to fast Numpy arrays and precompute some check
a = df['A'].values.astype('U8') == 'Yes'
b = df['B'].values.astype(np.int64)
d = df['D'].values.astype('U8') != 'xyz'

# Compute the result very quickly with Numba
c = compute(a, b, d)

# Store the result back
df['C'].values[:] = c.astype(object)

这是我机器上的最终性能：

Basic Pandas loops:    2510 us
This Numba code:         20 us

因此，Numba 实施速度 125 倍。事实上，大部分时间都花在了 Numpy 转换代码上，甚至没有花在 compute 上。在大型数据帧上差距应该更大。

请注意，行 df['C'].values[:] = c.astype(object) 比等效表达式 df['C'] = c 快得多（大约 16 倍）。

如何消除 Pandas Dataframe 中基于多个 if、elif 语句填充列的每一行值的循环

How to Eliminate for loop in Pandas Dataframe in filling each row values of a column based on multiple if,elif statements

python

optimization

numpy

vectorization

pandas