Python Pandas 空值的插补

Question

我正在尝试使用对应于行 df[row,'avg'] 的平均值和列 ('impute[col]') 的平均值的偏移量来估算 Null 值。有没有一种方法可以使方法与 .map 并行化？或者是否有更好的方法来遍历包含 Null 值的索引？

test = pd.DataFrame({'a':[None,2,3,1], 'b':[2,np.nan,4,2], 
                     'c':[3,4,np.nan,3], 'avg':[2.5,3,3.5,2]});  
df = df[['a', 'b', 'c', 'avg']];
impute = dict({'a':2, 'b':3.33, 'c':6 } )  

def smarterImpute(df, impute):  
    df2 = df
    for col in df.columns[:-1]:
        for row in test.index:  
            if pd.isnull(df.loc[row,col]):  
                df2.loc[row, col] = impute[col] 
                                    + (df.loc[:,'avg'].mean() - df.loc[row,'avg'] )

return print(df2)  

smarterImpute(test, impute)

Answer 1

请注意，在您的 'filling' 表达式中：

impute[col] + (df.loc[:,'avg'].mean() - df.loc[row,'avg']`

第一项仅取决于列，第三项仅取决于行；第二个只是一个常数。所以我们可以创建一个插补数据框来查找需要填充的值：

impute_df = pd.DataFrame(impute, index = test.index).add(test.avg.mean() - test.avg, axis = 0)

然后，有一个名为 .combine_first() 的方法允许您用另一个数据帧的值填充一个数据帧中的 NA，这正是我们所需要的。我们使用它，我们就完成了：

test.combine_first(impute_df)

对于 pandas，您通常希望避免使用循环，而寻求使用矢量化。

Python Pandas 空值的插补

Python Pandas imputation of Null values

python

pandas

imputation