"Addressing missing data" 如何帮助 KNN 更好地发挥作用?

How does "Addressing missing data" help KNN function better?

来源:- https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/

此页面有一个部分引用了以下段落:-

Best Prepare Data for KNN

Rescale Data: KNN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian distribution.

Address Missing Data: Missing data will mean that the distance between samples cannot be calculated. These samples could be excluded or the missing values could be imputed.

Lower Dimensionality: KNN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. KNN can benefit from feature selection that reduces the dimensionality of the input feature space.

请问有人可以详细解释第二点,即地址缺失数据吗?

在此上下文中缺少数据意味着某些样本不具有所有现有特征。

例如:

假设您有一个包含一组人的年龄和身高的数据库。 这意味着对于某些人来说,身高或年龄缺失。

现在,为什么这会影响 KNN?

给定一个测试样本 KNN 找到更接近它的样本(又名:年龄和身高相似的学生)。 KNN 这样做是为了根据最近的邻居对测试样本进行一些推断。

如果你想找到这些邻居,你必须能够计算样本之间的距离。要计算 2 个样本之间的距离,您必须具有这 2 个样本的所有特征。

如果其中一些缺失,您将无法计算距离。 所以隐含地你会丢失丢失数据的样本