使用同一数据帧的特定列作为参考同时从多列填充 NaN 值的最佳方法

Best way to fill NaN values from multiple columns at the same time using specific columns of the same dataframe as reference

示例:

DF = pd.DataFrame({'A': [0, 0, np.NaN, 0     , np.NaN, 0     , 0     , 0     ],
                   'B': [1, 1, np.NaN, 1     , np.NaN, 1     , 1     , 1     ],
                   'C': [8, 8, np.NaN, 8     , np.NaN, np.NaN, 8     , 8     ],
                   'D': [2, 2, 2     , np.NaN, np.NaN, 2     , np.NaN, np.NaN],
                   'E': [3, 3, 3     , np.NaN, np.NaN, 3     , np.NaN, np.NaN]})

我想要的预期结果是尽可能填充A列和B列,即:

   1) If DF['A'] line is NaN, it should get the correspondent DF['D'] line
   2) If DF['B'] line is NaN, it should get the correspondent DF['E'] line
   3) DF['C'] shall remain as it is

我正在尝试:

DF[['A', 'B']] = DF[['A','B']].fillna(DF[['D','E']])

但它似乎只有在有两个具有相同列名的不同数据框时才有效。我可以将 DF 拆分为 DF1 和 DF2,将 DF2['D'] 重命名为 A 并将 DF2['E'] 重命名为 B 并执行:

DF1[['A', 'B']] = DF1[['A','B']].fillna(DF2[['A','B']])

但我认为这不是最好的方法。有什么想法吗?

实际数据集有300万行,能得到最有效的解决方案就好了:)

谢谢!! :)

使用 np.where 是一个很好的选择,因为它适用于底层的 numpy 数组:

DF[['A','B']] = np.where(DF[['A','B']].isna(), DF[['D','E']], DF[['A','B']])

输出:

     A    B    C    D    E
0  0.0  1.0  8.0  2.0  3.0
1  0.0  1.0  8.0  2.0  3.0
2  2.0  3.0  NaN  2.0  3.0
3  0.0  1.0  8.0  NaN  NaN
4  NaN  NaN  NaN  NaN  NaN
5  0.0  1.0  NaN  2.0  3.0
6  0.0  1.0  8.0  NaN  NaN
7  0.0  1.0  8.0  NaN  NaN