特征归一化后 kNN 分类准确率下降?
Accuracy rate for kNN classification dropped after feature normalization?
我正在对一些数据进行 kNN 分类。我有按 80/20 的比例随机拆分用于训练和测试集的数据。
我的数据如下所示:
[ [1.0, 1.52101, 13.64, 4.49, 1.1, 71.78, 0.06, 8.75, 0.0, 0.0, 1.0],
[2.0, 1.51761, 13.89, 3.6, 1.36, 72.73, 0.48, 7.83, 0.0, 0.0, 2.0],
[3.0, 1.51618, 13.53, 3.55, 1.54, 72.99, 0.39, 7.78, 0.0, 0.0, 3.0],
...
]
矩阵最后一列中的项目是 类:1.0、2.0 和 3.0
在 特征规范化 之后,我的数据如下所示:
[[-0.5036443480260487, -0.03450760227559746, 0.06723230162846759, 0.23028986544844693, -0.025324623254270005, 0.010553065215338569, 0.0015136367098358505, -0.11291235596166802, -0.05819669234942126, -0.12069793876044387, 1.0],
[-0.4989050339943617, -0.11566537753097901, 0.010637426608816412, 0.2175704556290625, 0.03073267976659575, 0.05764598316498372, -0.012976783512350588, -0.11815839520204152, -0.05819669234942126, -0.12069793876044387, 2.0],
...
]
我用于归一化的公式:
(X - avg(X)) / (max(X) - min(X))
我对每个 K = 1 到 25(仅限奇数)执行 kNN 分类 100 次。我记录了每个使用的 K 的平均准确度。
这是我的结果:
Average accuracy for K=1 after 100 tests with different data split: 98.91313003886198 %
Average accuracy for K=3 after 100 tests with different data split: 98.11976006170633 %
Average accuracy for K=5 after 100 tests with different data split: 97.71226079929019 %
Average accuracy for K=7 after 100 tests with different data split: 97.47493145754373 %
Average accuracy for K=9 after 100 tests with different data split: 97.16596220947888 %
Average accuracy for K=11 after 100 tests with different data split: 96.81465365733266 %
Average accuracy for K=13 after 100 tests with different data split: 95.78772655522567 %
Average accuracy for K=15 after 100 tests with different data split: 95.23116406332706 %
Average accuracy for K=17 after 100 tests with different data split: 94.52371789094929 %
Average accuracy for K=19 after 100 tests with different data split: 93.85285871435981 %
Average accuracy for K=21 after 100 tests with different data split: 93.26620809747965 %
Average accuracy for K=23 after 100 tests with different data split: 92.58047022661833 %
Average accuracy for K=25 after 100 tests with different data split: 90.55746523509124 %
但是当我应用特征归一化时,准确率显着下降。
我的具有归一化特征的 kNN 结果:
Average accuracy for K=1 after 100 tests with different data split: 88.56128075154439 %
Average accuracy for K=3 after 100 tests with different data split: 85.01466511662318 %
Average accuracy for K=5 after 100 tests with different data split: 83.32096281613967 %
Average accuracy for K=7 after 100 tests with different data split: 83.09434478900455 %
Average accuracy for K=9 after 100 tests with different data split: 82.05628926919964 %
Average accuracy for K=11 after 100 tests with different data split: 79.89732262550343 %
Average accuracy for K=13 after 100 tests with different data split: 79.60617886853211 %
Average accuracy for K=15 after 100 tests with different data split: 79.26511126374507 %
Average accuracy for K=17 after 100 tests with different data split: 77.51457877706329 %
Average accuracy for K=19 after 100 tests with different data split: 76.97848441605367 %
Average accuracy for K=21 after 100 tests with different data split: 75.70005919265326 %
Average accuracy for K=23 after 100 tests with different data split: 76.45758217099551 %
Average accuracy for K=25 after 100 tests with different data split: 76.16619492431572 %
我在代码中的算法没有逻辑错误,我在简单的数据上检查过。
为什么kNN分类在特征归一化后准确率下降这么多?我想归一化本身不应该降低任何分类的准确率。那么使用特征归一化的目的是什么?
KNN 的工作方式是找到与其相似的实例。当它计算两点之间的 Euclidean Distance
时。现在通过规范化,您正在改变特征的规模,从而改变了您的准确性。
查看 this 研究。转到数字你会发现不同的缩放技术给出不同的精度。
归一化永远不会降低分类准确率是一种普遍的误解。很好。
怎么做?
一行中的相对值也很重要。事实上,他们确实确定了特征中点的位置 space。当您执行规范化时,它会严重抵消该相对位置。可以感觉到这一点,尤其是在 k-NN 分类中,因为它直接根据点之间的距离进行操作。相比之下,它的作用在SVM中就没有那么强烈了,因为在那种情况下,优化过程仍然可以找到一个相当准确的超平面。
您还应该注意,在这里,您使用 avg(X) 进行归一化。因此,考虑特定行的相邻列中的两个点。如果第一个点远低于平均值,第二个点远高于各自列的平均值,而在非标准化意义上,它们是非常接近的数值,距离计算可能会有很大差异。
永远不要指望标准化会创造奇迹。
我正在对一些数据进行 kNN 分类。我有按 80/20 的比例随机拆分用于训练和测试集的数据。
我的数据如下所示:
[ [1.0, 1.52101, 13.64, 4.49, 1.1, 71.78, 0.06, 8.75, 0.0, 0.0, 1.0],
[2.0, 1.51761, 13.89, 3.6, 1.36, 72.73, 0.48, 7.83, 0.0, 0.0, 2.0],
[3.0, 1.51618, 13.53, 3.55, 1.54, 72.99, 0.39, 7.78, 0.0, 0.0, 3.0],
...
]
矩阵最后一列中的项目是 类:1.0、2.0 和 3.0
在 特征规范化 之后,我的数据如下所示:
[[-0.5036443480260487, -0.03450760227559746, 0.06723230162846759, 0.23028986544844693, -0.025324623254270005, 0.010553065215338569, 0.0015136367098358505, -0.11291235596166802, -0.05819669234942126, -0.12069793876044387, 1.0],
[-0.4989050339943617, -0.11566537753097901, 0.010637426608816412, 0.2175704556290625, 0.03073267976659575, 0.05764598316498372, -0.012976783512350588, -0.11815839520204152, -0.05819669234942126, -0.12069793876044387, 2.0],
...
]
我用于归一化的公式:
(X - avg(X)) / (max(X) - min(X))
我对每个 K = 1 到 25(仅限奇数)执行 kNN 分类 100 次。我记录了每个使用的 K 的平均准确度。 这是我的结果:
Average accuracy for K=1 after 100 tests with different data split: 98.91313003886198 %
Average accuracy for K=3 after 100 tests with different data split: 98.11976006170633 %
Average accuracy for K=5 after 100 tests with different data split: 97.71226079929019 %
Average accuracy for K=7 after 100 tests with different data split: 97.47493145754373 %
Average accuracy for K=9 after 100 tests with different data split: 97.16596220947888 %
Average accuracy for K=11 after 100 tests with different data split: 96.81465365733266 %
Average accuracy for K=13 after 100 tests with different data split: 95.78772655522567 %
Average accuracy for K=15 after 100 tests with different data split: 95.23116406332706 %
Average accuracy for K=17 after 100 tests with different data split: 94.52371789094929 %
Average accuracy for K=19 after 100 tests with different data split: 93.85285871435981 %
Average accuracy for K=21 after 100 tests with different data split: 93.26620809747965 %
Average accuracy for K=23 after 100 tests with different data split: 92.58047022661833 %
Average accuracy for K=25 after 100 tests with different data split: 90.55746523509124 %
但是当我应用特征归一化时,准确率显着下降。 我的具有归一化特征的 kNN 结果:
Average accuracy for K=1 after 100 tests with different data split: 88.56128075154439 %
Average accuracy for K=3 after 100 tests with different data split: 85.01466511662318 %
Average accuracy for K=5 after 100 tests with different data split: 83.32096281613967 %
Average accuracy for K=7 after 100 tests with different data split: 83.09434478900455 %
Average accuracy for K=9 after 100 tests with different data split: 82.05628926919964 %
Average accuracy for K=11 after 100 tests with different data split: 79.89732262550343 %
Average accuracy for K=13 after 100 tests with different data split: 79.60617886853211 %
Average accuracy for K=15 after 100 tests with different data split: 79.26511126374507 %
Average accuracy for K=17 after 100 tests with different data split: 77.51457877706329 %
Average accuracy for K=19 after 100 tests with different data split: 76.97848441605367 %
Average accuracy for K=21 after 100 tests with different data split: 75.70005919265326 %
Average accuracy for K=23 after 100 tests with different data split: 76.45758217099551 %
Average accuracy for K=25 after 100 tests with different data split: 76.16619492431572 %
我在代码中的算法没有逻辑错误,我在简单的数据上检查过。
为什么kNN分类在特征归一化后准确率下降这么多?我想归一化本身不应该降低任何分类的准确率。那么使用特征归一化的目的是什么?
KNN 的工作方式是找到与其相似的实例。当它计算两点之间的 Euclidean Distance
时。现在通过规范化,您正在改变特征的规模,从而改变了您的准确性。
查看 this 研究。转到数字你会发现不同的缩放技术给出不同的精度。
归一化永远不会降低分类准确率是一种普遍的误解。很好。
怎么做?
一行中的相对值也很重要。事实上,他们确实确定了特征中点的位置 space。当您执行规范化时,它会严重抵消该相对位置。可以感觉到这一点,尤其是在 k-NN 分类中,因为它直接根据点之间的距离进行操作。相比之下,它的作用在SVM中就没有那么强烈了,因为在那种情况下,优化过程仍然可以找到一个相当准确的超平面。
您还应该注意,在这里,您使用 avg(X) 进行归一化。因此,考虑特定行的相邻列中的两个点。如果第一个点远低于平均值,第二个点远高于各自列的平均值,而在非标准化意义上,它们是非常接近的数值,距离计算可能会有很大差异。
永远不要指望标准化会创造奇迹。