Pandas groupby 变换

Question

需要关于 Pandas Groupby 转换行为的确认：

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                      'foo', 'bar'],
               'B' : ['one', 'one', 'two', 'three',
                      'two', 'two'],
               'C' : [1, 5, 5, 2, 5, 5],
               'D' : [2.0, 5., 8., 1., 2., 9.]})
grouped = df.groupby('A')
grouped.transform(lambda x: (x - x.mean()) / x.std())

          C         D
0 -1.154701 -0.577350
1  0.577350  0.000000
2  0.577350  1.154701
3 -1.154701 -1.000000
4  0.577350 -0.577350
5  0.577350  1.000000

它没有指定应用lambda函数的列。 pandas 如何决定应用该函数的列（在本例中为 C 和 D）？为什么它不适用于 B 列并引发错误？

为什么输出不包括 A 列和 B 列？

Answer 1

GroupBy.transform 为每个组中的每个列调用指定的函数（所以 B、C 和 D - 而不是 A 因为那是你分组依据）。但是，您调用的函数（mean 和 std）仅适用于数值，因此如果 dtype 不是数字，Pandas 将跳过该列。字符串列属于 dtype object，这不是数字，因此 B 被删除，剩下 C 和 D。

当你运行你的代码时你应该得到警告—

FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the transforming function.

如其所示，您需要在处理之前 select 要处理的列，以避免出现警告。您可以在调用 transform:

之前添加 [['C', 'D']]（添加到 select，例如您的 C 和 D 列）

grouped[['C', 'D']].transform(lambda x: (x - x.mean()) / x.std())
#      ^^^^^^^^^^^^ important

Pandas groupby 变换

Pandas groupby transform

python

pandas