如何在 python 中处理机器学习中缺失的 NaN

Question

如何在应用机器学习算法之前处理数据集中的缺失值？？。

我注意到删除缺失的 NAN 值并不是一件明智的事情。我通常使用 pandas 进行插值（计算均值）并填充数据，这可以提高分类精度，但可能不是最好的做法。

这是一个非常重要的问题。 处理数据集中缺失值的最佳方法是什么？

例如，如果你看到这个数据集，只有 30% 有原始数据。

Int64Index: 7049 entries, 0 to 7048
Data columns (total 31 columns):
left_eye_center_x            7039 non-null float64
left_eye_center_y            7039 non-null float64
right_eye_center_x           7036 non-null float64
right_eye_center_y           7036 non-null float64
left_eye_inner_corner_x      2271 non-null float64
left_eye_inner_corner_y      2271 non-null float64
left_eye_outer_corner_x      2267 non-null float64
left_eye_outer_corner_y      2267 non-null float64
right_eye_inner_corner_x     2268 non-null float64
right_eye_inner_corner_y     2268 non-null float64
right_eye_outer_corner_x     2268 non-null float64
right_eye_outer_corner_y     2268 non-null float64
left_eyebrow_inner_end_x     2270 non-null float64
left_eyebrow_inner_end_y     2270 non-null float64
left_eyebrow_outer_end_x     2225 non-null float64
left_eyebrow_outer_end_y     2225 non-null float64
right_eyebrow_inner_end_x    2270 non-null float64
right_eyebrow_inner_end_y    2270 non-null float64
right_eyebrow_outer_end_x    2236 non-null float64
right_eyebrow_outer_end_y    2236 non-null float64
nose_tip_x                   7049 non-null float64
nose_tip_y                   7049 non-null float64
mouth_left_corner_x          2269 non-null float64
mouth_left_corner_y          2269 non-null float64
mouth_right_corner_x         2270 non-null float64
mouth_right_corner_y         2270 non-null float64
mouth_center_top_lip_x       2275 non-null float64
mouth_center_top_lip_y       2275 non-null float64
mouth_center_bottom_lip_x    7016 non-null float64
mouth_center_bottom_lip_y    7016 non-null float64
Image                        7049 non-null object

Answer 1

What is the best way to handle missing values in data set?

没有最好的方法，每个 solution/algorithm 都有自己的优点和缺点（您甚至可以将其中一些混合在一起以创建自己的策略并调整相关参数以找到最能满足您需求的方法数据，关于这个话题有很多research/papers）。

例如，均值插补快速简单，但它会低估方差，用平均值替换NaN会扭曲分布形状，而KNN Imputation 在时间复杂度方面可能在大型数据集中并不理想，因为它遍历所有数据点并对每个 NaN 值执行计算，并且假设 NaN 属性与其他属性相关属性。

How to handle missing values in datasets before applying machine learning algorithm??

除了你提到的mean imputation，你还可以看看K-Nearest Neighbor Imputation和Regression Imputation，并参考强大的Imputer class in scikit-learn检查现有的API来使用。

KNN 插补

计算这个NaN点的k个最近邻居的平均值。

回归插补

估计回归模型可根据其他变量预测变量的观察值，然后在该变量缺失的情况下使用该模型估算值。

Here 链接到 scikit 的“缺失值的插补” 部分。我也听说过 Orange library for imputation，但还没有机会使用它。

Answer 2

处理缺失数据没有单一的最佳方法。最严格的方法是将缺失值建模为概率框架（如 PyMC）中的附加参数。通过这种方式，您将获得可能值的分布，而不仅仅是一个答案。下面是使用 PyMC 处理缺失数据的示例：http://stronginference.com/missing-data-imputation.html

如果您真的想用点估计来填补这些漏洞，那么您正在寻求执行 "imputation"。我会避开像 mean-filling 这样的简单插补方法，因为它们确实会破坏特征的联合分布。相反，尝试 softImpute (which tries you infer the missing value via low-rank approximation). The original version of softImpute is written for R but I've made a Python version (along with other methods like kNN imputation) here: https://github.com/hammerlab/fancyimpute

如何在 python 中处理机器学习中缺失的 NaN

How to handle missing NaNs for machine learning in python

python

machine-learning

missing-data

pandas