使用最频繁的值按组填充缺失值

Question

我正在尝试使用 Python 中的 pandas 模块使用最频繁的值来估算缺失值。在查看了 Stack Overflow 上的一些帖子后，我设法实现了这一点：

import numpy as np
import pandas as pd

df = pd.DataFrame({"group": ["A", "A", "A", "A", "B", "B", "B"],
                   "value": [1, 1, 1, np.nan, 2, np.nan, np.nan]})
df.groupby("group").transform(lambda x: x.fillna(x.mode().iloc[0]))

运行此代码将用 1 个“A”组缺失条目和 2 个“B”组缺失条目填充。但是，我们假设其中一组仅包含缺失数据（在本例中为组“B”）：

df1 = pd.DataFrame({"group": ["A", "A", "A", "A", "B", "B", "B"],
                   "value": [1, 1, 1, np.nan, np.nan, np.nan, np.nan]})
df1.groupby("group").transform(lambda x: x.fillna(x.mode().iloc[0]))

运行上面的代码会提示 IndexError: single positional indexer is out-of-bounds。我希望正常的行为是保持 np.nan 因为如果你运行方法 mode 只是，比方说，来自 df1:[=17= 的组“B” ]

df1[df1.group == "B"].mode()

我会知道 NaN 是最常见的值。我怎样才能避免这个问题？

Answer 1

Running the code above will prompt an IndexError: single positional indexer is out-of-bounds

这是因为 transform 将作为系列传递每一列，并且在某些时候它会单独看到 value 列；如果你这样做：

df1[df1.group == "B"].value.mode()

你得到

Series([], dtype: float64)

因此索引越界错误，因为它是空的并且 iloc[0] 不存在。

OTOH，当你这样做时：

df1[df1.group == "B"].mode()

mode 是在数据帧而不是系列上计算的，pandas 决定在全 NaN 列上给出 NaN，即此处的 value 列。

因此，一种补救措施是使用 apply 而不是 transform 将数据帧而不是单个系列传递给您的 lambda:

df1.groupby("group").apply(lambda x: x.fillna(x.mode().iloc[0])).reset_index(drop=True)

获得

  group  value
0     A    1.0
1     A    1.0
2     A    1.0
3     A    1.0
4     B    NaN
5     B    NaN
6     B    NaN

使用最频繁的值按组填充缺失值

Fill missing values by group using most frequent value

nan

python-3.x

pandas

imputation

pandas-groupby