如何查看应用 K-Means 算法后添加到集群的数据点?
How do I view the datapoints that are added to a cluster after applying K-Means algorithm?
我在scala中实现了k-means算法如下。
def clustering(clustnum:Int,iternum:Int,parsedData: RDD[org.apache.spark.mllib.linalg.Vector]): Unit= {
val clusters = KMeans.train(parsedData, clustnum, iternum)
println("The Cluster centers of each column for "+clustnum+" clusters and "+iternum+" iterations are:- ")
clusters.clusterCenters.foreach(println)
val predictions= clusters.predict(parsedData)
predictions.collect()
}
我知道如何打印每个集群的集群中心,但是 scala 中是否有一个函数可以打印哪些行已添加到哪个集群?
我正在处理的数据包含多行浮点值,每行都有一个 ID。它有大约 34 列和大约 200 行。我正在 scala 中研究 spark。
我需要能够看到结果。
因为在 Id_1 中是在集群 1 左右。
编辑:我能做到这么多
println(clustnum+" clusters and "+iternum+" iterations ")
val vectorsAndClusterIdx = parsedData.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
vectorsAndClusterIdx.collect().foreach(println)
它打印集群 ID 和添加到集群的行
行显示为字符串,簇ID在
之后打印
([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
但是有什么方法可以只打印行 ID 和簇 ID 吗?
使用数据框对我有帮助吗?
可以使用KMeansModel
的predict()
功能。
在您的代码中:
KMeans.train(parsedData, clustnum, iternum)
returns 一个 KMeansModel
对象。
所以,你可以这样做:
val predictions = clusters.predict(parsedData)
并得到 MapPartitionsRDD
作为结果。
predictions.collect()
为您提供 Array
的聚簇索引分配。
println(clustnum+" clusters and "+iternum+" iterations ")
val vectorsAndClusterIdx = parsedData.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
vectorsAndClusterIdx.collect().foreach(println)
似乎解决了我的问题。它打印集群 ID 和添加到集群的行
行显示为字符串,簇ID在
之后打印
([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
我在scala中实现了k-means算法如下。
def clustering(clustnum:Int,iternum:Int,parsedData: RDD[org.apache.spark.mllib.linalg.Vector]): Unit= {
val clusters = KMeans.train(parsedData, clustnum, iternum)
println("The Cluster centers of each column for "+clustnum+" clusters and "+iternum+" iterations are:- ")
clusters.clusterCenters.foreach(println)
val predictions= clusters.predict(parsedData)
predictions.collect()
}
我知道如何打印每个集群的集群中心,但是 scala 中是否有一个函数可以打印哪些行已添加到哪个集群?
我正在处理的数据包含多行浮点值,每行都有一个 ID。它有大约 34 列和大约 200 行。我正在 scala 中研究 spark。
我需要能够看到结果。 因为在 Id_1 中是在集群 1 左右。
编辑:我能做到这么多
println(clustnum+" clusters and "+iternum+" iterations ")
val vectorsAndClusterIdx = parsedData.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
vectorsAndClusterIdx.collect().foreach(println)
它打印集群 ID 和添加到集群的行
行显示为字符串,簇ID在
之后打印([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
但是有什么方法可以只打印行 ID 和簇 ID 吗?
使用数据框对我有帮助吗?
可以使用KMeansModel
的predict()
功能。
在您的代码中:
KMeans.train(parsedData, clustnum, iternum)
returns 一个 KMeansModel
对象。
所以,你可以这样做:
val predictions = clusters.predict(parsedData)
并得到 MapPartitionsRDD
作为结果。
predictions.collect()
为您提供 Array
的聚簇索引分配。
println(clustnum+" clusters and "+iternum+" iterations ")
val vectorsAndClusterIdx = parsedData.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
vectorsAndClusterIdx.collect().foreach(println)
似乎解决了我的问题。它打印集群 ID 和添加到集群的行
行显示为字符串,簇ID在
之后打印([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)