在 Spark 中使用轮廓聚类

Question

我想在 Spark 中使用 KMeans 聚类时使用剪影来确定 k 的最佳值。有什么最佳的并行化方法吗？即使其可扩展

Answer 1

不，剪影的定义是不可缩放的。

它使用成对距离，这总是需要 O(n^2) 时间来计算。

您需要使用不同的东西。在大数据上使用 Silhouette 是荒谬的，计算评估指标比运行实际的 k-means 聚类算法需要更长的时间。

或者重新考虑一下你在做什么。例如，使用剪影是否有意义？您还可以决定在单个节点上运行比 Spark 更快的东西，在那里计算 Silhouette，然后通过 k 简单地并行化，而不分布式计算的所有开销。 Spark 可能会战胜 MapReduce-Mahout，但它会输给一个好的非分布式实现。

Answer 2

是的，有一些方法可以使剪影指标可扩展。不，它没有像我在这里描述的那样一直发布。它并不那么复杂，因此您也可以理解它并可能编写它。请让我知道，如果你先写的话我可以使用它。

看来我们都需要编写一个高性能的剪影得分器。输入任何聚类列向量，使该评分器能够处理每个聚类实施。如果可能，使用 mapreduce 以获得简单的分布式版本和共享内存。看起来有可能。第 4 页显示数学： http://cran.us.r-project.org/web/packages/clValid/vignettes/clValid.pdf An LSH would help algorithmically since it avoids the exact distance computations that dominate its math. A good LSH implementation would then be essential but I have not found one. Sklearn’s LSHForest is the right idea but not implemented well enough. A simplified silhouette or approximate would be interesting too. The LSH inclusion would result in approximate results. Use LSH capability to find only the nearest point and centroid, which avoids the all-pairs computations. Page 28 of this article has several good suggestions: https://arxiv.org/pdf/1605.01802.pdf 好像在说：使用简化的轮廓而不是简单的轮廓，如下所示：将计算从点到点的距离，更改为点到簇质心的距离。它是从簇内所有点对和最近邻簇的 O(n^2) 减少到线性长度 O(N) 计算。以下是我的理解和翻译：

Start with:
File of cluster tuples: (clusterID, cluster centroid point) 
File of example point tuples: (example point, clusterID). Notice the clusterID is the clusterID column vector described above.
Processing:
For each cluster tuple, run a map(): hashmap[clusterID] = cluster centroid point
For each example point tuple, run:
map(): (dist = distance between point and its cluster centroid point, example point (copy), clusterID(copy))
map(): find closest cluster centroid point to the example point
map(): emit SSI = (distance - minClusterDistance)/minClusterDistance
reduce(): By clusterID emit (clusterID, cluster’s sum of SSI / #points)

我可能最终会成为实施者。太疯狂了，以前从来没有人写过这么快的。人们已经在我的预料中做到了，但出于竞争目的（企业利润、Kaggle 排名等），他们将其保留给自己。

以上格式为代码但不是代码。它是英文大纲或伪代码。 Whosebug 迫使我将此部分格式化为接受它的代码。

Answer 3

ClusteringEvaluator 从 Spark 2.3.0 开始可用，它计算 Silhouette 分数。

Answer 4

我无法添加评论，但我想强调郭玉林的回答：

ClusteringEvaluator is available since Spark 2.3.0, which compute Silhouette score.

这是在 SPARK-14516 中引入的可扩展、高效 实现。

在 Spark 中使用轮廓聚类

Using Silhouette Clustering in Spark

cluster-analysis

machine-learning

distributed-computing

k-means

apache-spark