二等分 K-Means spark ml - 除法规则是什么?

Bisecting K-Means spark ml - what is the division rule?

我开始在 pyspark 中使用 Bisecting K-Means Clustering,我想知道聚类期间的除法规则是什么。

我知道K-Means在那里完成了,但是下一个划分的下一个集群是如何选择的?我已经看到有几种方法(例如,最大的集群被划分/内部相似性较低的集群),但我找不到在 spark ml 中实现的划分规则是什么。

感谢您的帮助

根据 Pyspark ML 文档 (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans), Bisecting KMeans algorithm is based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar (https://www.cs.cmu.edu/~dunja/KDDpapers/Steinbach_IR.pdf)。

第 3 部分:

We found little difference between the possible methods for selecting a cluster to split and chose to split the largest remaining cluster.

对 Pyspark 进行了修改。根据 Pyspark 文档:

The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.