fillna in Pandas 运行两次有什么问题？

Question

我是 Pandas 和 Numpy 的新手。我试图解决 Kaggle | Titanic Dataset。现在我必须修复“Age”和“Embarked”两列，因为它们包含 NAN。

现在我尝试了 fillna 但没有成功，很快发现我缺少 inplace = True。

现在我附加了它们。但是第一个插补成功了，而第二个插补却没有。我尝试在 SO 和 google 中搜索，但没有找到任何有用的东西。请帮助我。

这是我尝试的代码。

# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
    
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)

print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size

我得到的输出是

0
2

但是我设法在不使用 inplace=True

的情况下得到了我想要的东西

titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())

但我很好奇 inplace=True 的 second usage 是怎么回事。

如果我问的问题非常愚蠢，请耐心等待，因为我是全新的，我可能会遗漏一些小问题。任何帮助表示赞赏。提前致谢。

Answer 1

pd.Series.mode return是一个系列。

一个变量只有一个算术平均数和一个中位数，但它可能有多个众数。如果有多个值出现频率最高，就会有多种模式。

pandas 对标签进行操作。

titanic_df.mean()
Out: 
PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

如果我要使用 titanic_df.fillna(titanic_df.mean()) 它将 return 一个新的 DataFrame，其中 PassengerId 列填充 446.0，Survived 列填充 0.38 等等上。

但是，如果我调用 Series 的 mean 方法，returning 值是一个浮点数：

titanic_df['Age'].mean()
Out: 29.69911764705882

此处没有关联的标签。因此，如果我使用 titanic_df.fillna(titanic_df['Age'].mean()) 所有列中的所有缺失值都将填充 29.699.

为什么第一次尝试不成功

您试图用 titanic_df["Embarked"].mode() 填充整个 DataFrame titanic_df。让我们先检查一下输出：

titanic_df["Embarked"].mode()
Out: 
0    S
dtype: object

这是一个只有一个元素的系列。索引是 0，值是 S。现在，回想一下如果我们使用 titanic_df.mean() 来填充它是如何工作的：它会用相应的平均值填充每一列。在这里，我们只有一个标签。因此，如果我们有一个名为 0 的列，它只会填充值。尝试添加 df[0] = np.nan 并再次执行您的代码。您会看到新列中填满了 S.

为什么第二次尝试（不）成功

等式的右边，titanic_df.fillna(titanic_df["Embarked"].mode()) return 是一个新的 DataFrame。在这个新的 DataFrame 中，Embarked 列仍然有 nan 的：

titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2

但是你并没有把它赋值给整个DataFrame。您将此 DataFrame 分配给了一个系列 - titanic_df['Embarked']。它实际上并没有填充 Embarked 列中的缺失值，它只是使用了 DataFrame 的索引值。如果您实际检查新列，您会看到数字 1、2...，而不是 S、C 和 Q。

你应该怎么做

您正在尝试用单个值填充单个列。首先，取消该值与其标签的关联：

titanic_df['Embarked'].mode()[0]
Out: 'S'

现在，使用inplace=True或返回结果并不重要。两者

titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])

和

titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)

将用 S 填充 Embarked 列中的缺失值。

当然这是假设如果有多种模式你想使用第一个值。您可能需要在那里改进算法（例如，如果有多种模式，则从值中随机 select）。

fillna in Pandas 运行两次有什么问题？

What's wrong with fillna in Pandas running twice?

python

pandas

fillna

pd.Series.mode return是一个系列。

pandas 对标签进行操作。

为什么第一次尝试不成功

为什么第二次尝试（不）成功

你应该怎么做

fillna in Pandas 运行 两次有什么问题？

What's wrong with fillna in Pandas running twice?

python

pandas

fillna

pd.Series.mode return是一个系列。

pandas 对标签进行操作。

为什么第一次尝试不成功

为什么第二次尝试（不）成功

你应该怎么做

fillna in Pandas 运行两次有什么问题？