最近的质心分类器真的很低效吗？

Question

我目前正在阅读 Ethem Alpaydin 的 "Introduction to machine learning"，我遇到了最近的质心分类器并尝试实现它。我想我已经正确地实现了分类器，但我的准确率只有 68%。那么，最近的质心分类器本身是低效的还是我的实现中存在一些错误（如下）？

数据集包含 1372 个数据点，每个数据点有 4 个特征，有 2 个输出类我的 MATLAB 实现：

  DATA = load("-ascii", "data.txt");

#DATA is 1372x5 matrix with 762 data points of class 0 and 610 data points of class 1
#there are 4 features of each data point 
X = DATA(:,1:4); #matrix to store all features

X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal 
Y = DATA(:,5); # to store outputs

mean0 = sum(X0)/610; #mean of features of class 0 
mean1 = sum(X1)/610; #mean of featurs of class 1

count = 0;
for i = 1:1372 
   pre = 0;
  cost1 = X(i,:)*(mean0'); #calculates the dot product of dataset with mean of features of both classes
  cost2 = X(i,:)*(mean1');

  if (cost1<cost2)
    pre = 1;
  end
  if pre == Y(i)
    count = count+1; #counts the number of correctly predicted values
   end

end

disp("accuracy"); #calculates the accuracy 
disp((count/1372)*100);

Answer 1

这里至少有几点：

您正在使用点积来分配输入中的相似性 space，这几乎永远不会 有效。使用点积的唯一原因是假设所有数据点都具有相同的范数，或者范数无关紧要（几乎永远不会正确）。尝试改用欧几里德距离，尽管它非常幼稚 - 它应该明显更好
它是低效分类器吗？取决于效率的定义。这是一个非常简单和快速的方法，但在预测能力方面非常糟糕。其实比朴素贝叶斯还差，已经算是"toy model".
代码也有问题
```
X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal 
```
一旦你对 X0 进行了子采样，你就有了 1220 个训练样本，但稍后在 "testing" 期间你对训练和 "missing elements of X0" 进行了测试，从概率的角度来看这并没有真正意义。首先，你永远不应该在训练集上测试准确性（因为它高估了真实的准确性），其次，通过对你的训练数据进行子采样，你是 not 均衡先验。不是像这样的方法，你只是在降低你的质心估计的质量，没有别的。这些类型的技术（sub/over- 抽样）均衡模型的先验， 做模型先验。你的方法没有（因为它基本上是假设先验为 1/2 的生成模型），所以不会发生任何好事。

最近的质心分类器真的很低效吗？

Is nearest centroid classifier really inefficient?

matlab

classification

machine-learning

centroid

multiclass-classification