为什么NMI值小而聚类精度高和聚类中的Rand指数

Question

我正在使用 https://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement for evaluating my clustering accuracy in MATLAB, it provides accuracy and rand_index, the performance is normal as expect. However, when I try to use NMI as a metric, the clustering performance is extremely low, I am using the source code (https://www.mathworks.com/matlabcentral/fileexchange/29047-normalized-mutual-information)。

实际上我有两个 Nx1 向量作为输入，一个是实际标签，另一个是标签分配。我基本上检查了里面的每个元素，我发现即使我有 82% rand_index，NMI 也只有 0.3209。下面是使用 MATLAB 内置 K-Means 的鸢尾花数据集 https://archive.ics.uci.edu/ml/datasets/iris 的示例。

data = iris(:,1:data_dim);
k = 3;
[result_label,centroid] = kmeans(data,k,'MaxIter',10000);
actual_label = iris(:,end);

NMI = nmi(actual_label,result_label);
[Acc,rand_index,match] = AccMeasure(actual_label',result_label');

结果：

Auto ACC: 0.820000 Rand_Index: 0.701818 NMI: 0.320912

Answer 1

随着数据点数量的增加（即使比较随机聚类），兰德指数将趋向于 1，因此当您拥有大数据集时，您永远不会真正期望看到较小的兰德值。

同时，当你所有的点都落在同一个大簇中时，准确率会很高。

我感觉 NMI 正在提供更可靠的比较。为了验证，尝试运行降维并根据两个聚类用颜色绘制数据点。视觉统计通常是培养数据直觉的最佳选择。

如果您想探索更多，一个方便的 python 聚类比较包是 CluSim。

为什么NMI值小而聚类精度高和聚类中的Rand指数

Why the NMI value is small while having higher clustering accuracy and Rand index in clustering

matlab

cluster-analysis

k-means

nmi