fillna() 和 loc() 的赋值显然不起作用

Question

我到处找答案，但找不到。

我的目标：我正在尝试填充 DataFrame 中的一些缺失值，使用监督学习来决定如何填充它。

我的代码如下所示：注意 - 第一部分并不重要，只是提供上下文

train_df = df[df['my_column'].notna()]     #I need to train the model without using the missing data
train_x = train_df[['lat','long']]         #Lat e Long are the inputs
train_y = train_df[['my_column']]          #My_column is the output
clf = neighbors.KNeighborsClassifier(2)
clf.fit(train_x,train_y)                   #clf is the classifies, here we train it
df_x = df[['lat','long']]                  #I need this part to do the prediction
prediction = clf.predict(df_x)             #clf.predict() returns an array
series_pred = pd.Series(prediction)        #now the array is a series
print(series_pred.shape)                   #RETURNS (2381,)
print(series_pred.isna().sum())            #RETURN 0

到目前为止，还不错。我有 2381 个预测（我只需要其中几个）并且里面没有 NaN 值（为什么预测中会有 NaN 值？我只是想确定，因为我不明白我的错误）

在这里，我尝试将预测分配给我的 Dataframe：

#test_1
df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred  #I assign the predictions using .loc()
#test_2
df['my_colum'] =  df['my_colum'].fillna(series_pred)     #Double check: I assign the predictions using .fillna()
print(df['my_colum'].shape)                      #RETURNS (2381,)
print(df['my_colum'].isna().sum())               #RETURN 6

如您所见，it 不起作用：缺失值仍然是 6。我随机尝试了一种稍微不同的方法：

#test_3
df[['my_colum']] =  df[['my_colum']].fillna(series_pred)     #Will it work?
print(df[['my_colum']].shape)                        #RETURNS (2381, 1)
print(df[['my_colum']].isna().sum())                 #RETURNS 6

没有成功。我决定尝试最后一件事：甚至在 将结果 分配给原始 df:

之前检查 fillna 结果

In[42]:
print(df['my_colum'].fillna(series_pred).isna().sum())  #extreme test
Out[42]:
6

所以...我的 非常非常 愚蠢的错误在哪里？非常感谢

编辑 1

为了展示一点数据，

In[1]:
df.head()
Out[1]:
      my_column      lat    long
 id                                                     
9df   Wil            51     5
4f3   Fabio          47     9
x32   Fabio          47     8   
z6f   Fabio          47     9  
a6f   Giovanni       47     7

此外，我在问题的开头添加了信息

Answer 1

@Ben.T 或@Dan 应该 post 他们自己的答案，他们应该被接受为正确答案。

根据他们的提示，我会说有两个解决方案：

解决方案 1（最佳）：使用 loc()

问题

当前解决方案的问题是 df.loc[df['my_column'].isna(), 'my_column'] 期望接收 X 个值，其中 X 是缺失值的数量。我的变量 prediction 实际上有缺失值和非缺失值的预测

解决方法

pred_df = df[df['my_column'].isna()]        #For the prediction, use a Dataframe with only the missing values. Problem solved
df_x = pred_df[['lat','long']]
prediction = clf.predict(df_x)
df.loc[df['my_column'].isna(), 'my_column'] = prediction

解决方案 2：使用 fillna()

问题

当前解决方案的问题是 df['my_colum'].fillna(series_pred) 要求我的 df 的索引与 series_pred 相同，在这种情况下这是不可能的，除非你有一个简单的df 中的索引，例如 [0, 1, 2, 3, 4...]

解决方法

重置代码开头的 df 索引。

为什么这不是最好的

最简洁的方法是仅在需要时进行预测。这种方法很容易用loc()得到，我不知道你怎么用fillna()得到它，因为你需要通过分类保存索引

编辑：series_pred.index = df['my_column'].isna().index感谢@Dan

fillna() 和 loc() 的赋值显然不起作用

Assignment with both fillna() and loc() apparently not working

python

numpy

pandas

supervised-learning

fillna

编辑 1

解决方案 1（最佳）：使用 loc()

解决方案 2：使用 fillna()