为什么我不能在 Spark KMeans 算法上设置 epsilon=1e-4？

Question

我想通过设置 epsilon=1e-4 而不是设置 numIterations 在 Spark 上训练 K-means 模型。在 spark shell 中，我输入：

val model = KMeans.train(trainRDD, numClusters=8, runs=30, initializationMode="k-means||",epsilon=1e-4)

但是报错，报错信息如下：

scala> val model = KMeans.train(trainRDD, numClusters=8, runs=30, initializationMode="k-means||",epsilon=1e-4)
<console>:48: error: overloaded method value train with alternatives:
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int,initializationMode: String)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int,initializationMode: String,seed: Long)org.apache.spark.mllib.clustering.KMeansModel
 cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector], numClusters: Int, runs: Int, initializationMode: String, epsilon: Double)
       val model = KMeans.train(trainRDD, numClusters=8, runs=30, initializationMode="k-means||",epsilon=1e-4)
                          ^

我该怎么办？

Answer 1

没有定义这样的 train 方法。

使用真正的构造函数，并根据需要设置参数。

查看文档：http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans

然后使用setEpsilon设置提前终止阈值。

为什么我不能在 Spark KMeans 算法上设置 epsilon=1e-4？

why can't I set epsilon=1e-4 on Spark KMeans algorithm?

cluster-analysis

k-means

apache-spark

apache-spark-mllib