计算 pyspark 数据框中的聚类成本

Question

我有一个包含百万条记录的数据框，我使用了 pyspark ml .

KMeans to identify clusters ,现在我想找到我用过的簇数的within set sum of squares error((WSSSE).

我的 spark 版本是 1.6.0 并且 computeCost 在 pyspark ml 中不可用直到 spark 2.0.0，所以我必须自己制作它。

我已经使用这种方法来查找平方误差，但它需要很长时间才能给我输出结果。我正在寻找一种更好的方法来查找 WSSSE。

check_error_rdd = clustered_train_df.select(col("C5"),col("prediction"))

c_center = cluster_model.stages[6].clusterCenters()
check_error_rdd = check_error_rdd.rdd
print math.sqrt(check_error_rdd.map(lambda row:(row.C5- c_center[row.prediction])**2).reduce(lambda x,y: x+y) )

clustered_train_df是我拟合ML PIPELINE后的原始训练数据，C5是KMeans中的featuresCol。

check_error_rdd 如下所示：

check_error_rdd.take(2)
Out[13]: 
[Row(C5=SparseVector(18046, {2398: 1.0, 17923: 1.0, 18041: 1.0, 18045: 0.19}), prediction=0),
 Row(C5=SparseVector(18046, {1699: 1.0, 17923: 1.0, 18024: 1.0, 18045: 0.91}), prediction=0)]

c_center 是聚类中心的列表，其中每个中心都是长度为 18046 的列表：

print len(c_center[1]) 
18046

Answer 1

我计算了。

至于你说的“慢”：对于100m点，8192个质心，我花了50分钟计算成本，64个执行器和202092个分区，8G内存和6核每个机器，处于客户端模式。

引用ref：

computeCost(rdd)

Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

Parameters: rdd – The RDD of points to compute the cost on.

New in version 1.4.0.

如果您因为有 DataFrame 而无法使用它，请阅读：

至于你的做法，我一眼就看不出有什么不好。

计算 pyspark 数据框中的聚类成本

Calculate cost of clustering in pyspark data frame

distributed-computing

k-means

dataframe

apache-spark

pyspark