Pandas 基于 groupby 创建新列并应用 lambda if 语句

Question

我对 groupby 和 apply

有疑问

df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})

我想为每个组创建一个列 C，如果 B > z_score=2 则取值 1，否则取 0。代码：

from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)

但是，我对代码不成功，无法找出问题所在

Answer 1

将 GroupBy.transform 与 lambda、函数一起使用，然后进行比较并将 True/False 转换为 1/0 转换为整数：

from scipy import stats

s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)

或使用numpy.where:

df['C'] = np.where(s > 2, 1, 0)

您的解决方案中的错误是每个组：

from scipy import stats

df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

如果检查 pandas docs 中的陷阱：

pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.

因此，如果改用其中一种解决方案 if-else:

from scipy import stats

df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))

print (df)
A
a       [0, 0, 0]
b    [0, 0, 0, 0]
Name: B, dtype: object

但随后需要转换为列，为避免此问题使用 groupby.transform。

Answer 2

您可以使用 groupby + apply 函数来查找每个组中每个项目的 z 分数；展开结果列表；使用 gt 创建一个布尔系列并将其转换为 dtype int

df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)

输出：

Pandas 基于 groupby 创建新列并应用 lambda if 语句

Pandas create new column base on groupby and apply lambda if statement

lambda

apply

dataframe

pandas

pandas-groupby