演示阈值 h2o kmeans

Demonstrate threshold h2o kmeans

我在 R 中使用 h2o kmeans 来划分我的人口。该方法需要审核，所以我想解释一下h2o的kmeans中使用的阈值。

h2o kmeans (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html) 的文档中说：

H2O uses proportional reduction in error (PRE) to determine when to stop splitting. The PRE value is calculated based on the sum of squares within (SSW).

PRE=(SSW[before split]−SSW[after split])/SSW[before split]

H2O stops splitting when PRE falls below a threshold, which is a function of the number of variables and the number of cases as described below:

threshold takes the smaller of these two values:

either 0.8 or [0.02 + 10/number_of_training_rows + 2.5/(number_of_model_features)^2]

源代码（https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/kmeans/KMeans.java）给出为：

final double rel_improvement_cutoff = Math.min(0.02 + 10. / _train.numRows() + 2.5 / Math.pow(model._output.nfeatures(), 2), 0.8);

这个门槛从何而来？有关于它的科学论文吗？

我负责那个门槛。我通过 k-means 算法通过运行大量数据集（人工和真实数据集）开发了它。几年前，我开始研究 SSW 改进并将其作为 chi-square 变量进行测试，正如 John Hartigan 所推荐的那样。这个标准在很多情况下都失败了，所以我转向了 PRE。上面的等式是将非线性模型拟合到具有已知簇数的数据集的结果。当我为 Tableau 编写 k-means 程序时，我使用了相同的 PRE 标准。在我离开 Tableau 转而使用 H2O 后，他们用 Calinski-Harabasz 索引替换了我的 PRE 规则，产生了类似的结果。 Leland Wilkinson，H2O 首席科学家。