如何查看应用 K-Means 算法后添加到集群的数据点？

Question

我在scala中实现了k-means算法如下。

def clustering(clustnum:Int,iternum:Int,parsedData: RDD[org.apache.spark.mllib.linalg.Vector]): Unit= {
val clusters = KMeans.train(parsedData, clustnum, iternum)

println("The Cluster centers of each column for "+clustnum+" clusters and "+iternum+" iterations are:- ")


clusters.clusterCenters.foreach(println) 

val predictions= clusters.predict(parsedData)

 predictions.collect()

}

我知道如何打印每个集群的集群中心，但是 scala 中是否有一个函数可以打印哪些行已添加到哪个集群？

我正在处理的数据包含多行浮点值，每行都有一个 ID。它有大约 34 列和大约 200 行。我正在 scala 中研究 spark。

我需要能够看到结果。因为在 Id_1 中是在集群 1 左右。

编辑：我能做到这么多

println(clustnum+" clusters and "+iternum+" iterations ")

val vectorsAndClusterIdx = parsedData.map{ point => 
val prediction = clusters.predict(point) 
(point.toString, prediction) 
} 

vectorsAndClusterIdx.collect().foreach(println)

它打印集群 ID 和添加到集群的行

行显示为字符串，簇ID在

之后打印

([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)

但是有什么方法可以只打印行 ID 和簇 ID 吗？

使用数据框对我有帮助吗？

Answer 1

可以使用KMeansModel的predict()功能。

查看文档：http://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel

在您的代码中：

KMeans.train(parsedData, clustnum, iternum)

returns 一个 KMeansModel 对象。

所以，你可以这样做：

val predictions = clusters.predict(parsedData)

并得到 MapPartitionsRDD 作为结果。

predictions.collect()

为您提供 Array 的聚簇索引分配。

Answer 2

println(clustnum+" clusters and "+iternum+" iterations ")

val vectorsAndClusterIdx = parsedData.map{ point => 
val prediction = clusters.predict(point) 
(point.toString, prediction) 
} 

vectorsAndClusterIdx.collect().foreach(println)

似乎解决了我的问题。它打印集群 ID 和添加到集群的行

行显示为字符串，簇ID在

之后打印

([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)

如何查看应用 K-Means 算法后添加到集群的数据点？

How do I view the datapoints that are added to a cluster after applying K-Means algorithm?

scala

apache-spark

cluster-analysis

ibm-cloud

k-means