用前几行的平均值填充 NaN 值？

Question

我必须用前 3 个实例的平均值填充数据框中一列的 nan 值。以下是示例：

df = pd.DataFrame({'col1': [1, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
df
col1
0   1.0
1   3.0
2   4.0
3   5.0
4   NaN
5   NaN
6   NaN 
7   7.0

这是我需要的输出：

我尝试了 pd.rolling，但是当该列在一卷中有多个 NaN 值时，它无法按我想要的方式工作：

df.fillna(df.rolling(3, min_periods=1).mean().shift())


col1
0   1.0
1   3.0
2   4.0
3   5.0
4   4.0 # np.nanmean([3, 4, 5])
5   4.5 # np.nanmean([np.NaN, 4, 5])
6   5.0 # np.nanmean([np.NaN, np.naN ,5])
7   7.0

有人可以帮我吗？提前致谢！

Answer 1

我尝试了两种方法来解决这个问题。一个是在数据帧上循环，第二个本质上是多次尝试您建议的方法，以收敛于正确的答案。

循环方法

对于数据框中的每一行，从 col1 获取值。然后，取最后几行的平均值。（如果我们在数据帧的开头，此列表中的数量可能少于 3。）如果值为 NaN，请将其替换为平均值。然后，将值保存回数据框中。如果最后一行的值列表超过 3 个值，则删除最后一个。

def impute(df2, col_name):
    last_3 = []
    for index in df.index:
        val = df2.loc[index, col_name]
        if len(last_3) > 0:
            imputed = np.nanmean(last_3)
        else:
            imputed = None
        if np.isnan(val):
            val = imputed
        last_3.append(val)
        df2.loc[index, col_name] = val
        if len(last_3) > 3:
            last_3.pop(0)

重复列操作

这里的核心思想是注意在你的例子pd.rolling中，第一个NA替换值是正确的。因此，您应用滚动平均值，为每个运行的 NA 值取第一个 NA 值，并使用该数字。如果你重复应用这个，你会填写第一个缺失值，然后是第二个缺失值，然后是第三个。您需要运行这个循环次数与最长的连续 NA 值系列一样多。

def impute(df2, col_name):
    while df2[col_name].isna().any().any():
        # If there are multiple NA values in a row, identify just
        # the first one
        first_na = df2[col_name].isna().diff() & df2[col_name].isna()
        # Compute mean of previous 3 values
        imputed = df2.rolling(3, min_periods=1).mean().shift()[col_name]
        # Replace NA values with mean if they are very first NA
        # value in run of NA values
        df2.loc[first_na, col_name] = imputed

性能比较

运行这两个都在一个 80000 行的数据帧上，我得到以下结果：

Loop approach takes 20.744 seconds
Repeated column operation takes 0.056 seconds

Answer 2

可能不是最有效的，但简洁并能完成工作

from functools import reduce
reduce(lambda d, _: d.fillna(d.rolling(3, min_periods=3).mean().shift()), range(df['col1'].isna().sum()), df)

输出


    col1
0   1.000000
1   3.000000
2   4.000000
3   5.000000
4   4.000000
5   4.333333
6   4.444444
7   7.000000

我们基本上使用 fillna 但需要 min_periods=3 意味着它一次只会填充一个 NaN，或者更确切地说是那些紧接其前的三个非 NaN 数字的 NaN。然后我们使用 reduce 重复此操作，次数与 col1

中的 NaN 一样多

用前几行的平均值填充 NaN 值？

Fill NaN values wit mean of previous rows?

python

nan

mean

pandas

fillna

循环方法

重复列操作

性能比较