for 循环中使用的 fillna() 不影响数据帧

Question

我调用了 train 一个包含巨大灾难数据的数据框。在这些列中，有 'Pclass' 表示乘客的 class（有三个 classes，1,2 和 3）和 'Age'。不是所有的年龄都是已知的，我想用平均值填充 'Age' 中的 Nan 值，但我想根据 class.

放置不同的平均值

代码如下

for i in np.arange(1,4):
     obj=train[train['Pclass']==i]['Age'].mean()
     train[train['Pclass']==i]['Age'].fillna(value=obj,inplace=True)

但是当我调用数据框时，Nan 值仍然存在。有人可以解释一下为什么吗？

Answer 1

在我的系统上 SettingWithCopyWarning，在 Pandas 文档中有 link 到 this caveats section。具体细节比较复杂，里面有多个部分的解释，对理解很有帮助。

推荐的索引方法是使用loc，带掩码和单个固定索引，如

nans = train['Age'].isna() # find all the relevant nans, once
for i in range(1,4):
    mask = train['Pclass'] == i
    # incorporate known nan locations to perform a single __setitem__ on a loc
    train.loc[mask & nans, 'Age'] = train.loc[mask, 'Age'].mean()

这是有效的，因为它是对 loc 结果的单个 __setitem__ 调用 (foo[item] = bar)，保证是原始 DataFrame 的视图。相比之下，在 __getitem__ 调用 (foo[item].fillna(...)) 的结果上使用 fillna 可能意味着 fillna 在切片的副本上而不是在视图上操作原始 DataFrame（这里似乎就是这种情况）。 fillna 中的 inplace 参数将执行其预期的操作，但由于它处理的是副本而不是原始文件，因此您无法访问结果。

根据我 link 编辑的文档，

Outside of simple cases, it’s very hard to predict whether [__getitem__] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the __setitem__ will modify [the original DataFrame] or a temporary object that gets thrown out immediately afterward. That’s what SettingWithCopy is warning you about!

作为一个小奖励，在这里使用 loc 并重复使用 mask 比您开始使用的链式索引更有效。

for 循环中使用的 fillna() 不影响数据帧

fillna() used in a for loop doesn't affect the dataframe

python

dataframe

fillna