'KMeansModel' 对象在 apache pyspark 中没有属性 'computeCost'

'KMeansModel' object has no attribute 'computeCost' in apache pyspark

我正在 pyspark 中试验聚类模型。我试图让集群的均方成本适合不同的 K

def meanScore(k,df):
  inputCol = df.columns[:38]
  assembler = VectorAssembler(inputCols=inputCols,outputCol="features")
  kmeans = KMeans().setK(k)
  pipeModel2 = Pipeline(stages=[assembler,kmeans])
  kmeansModel = pipeModel2.fit(df).stages[-1]
  kmeansModel.computeCost(assembler.transform(df))/data.count()

当我尝试调用此函数来计算数据帧中不同 K 值的成本时

for k in range(20,100,20):
  sc = meanScore(k,numericOnly)
  print((k,sc))

我收到一个属性错误 AttributeError: 'KMeansModel' 对象没有属性 'computeCost'

我是 pyspark 的新手,正在学习,我真诚地感谢任何帮助。谢谢

它在 Spark 3.0.0 中已被弃用Docs建议使用评估器。

Note Deprecated in 3.0.0. It will be removed in future versions. 
Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.

正如 Erkan sirin 提到的 computeCost 在最新版本中已被弃用,这可能会帮助您解决问题

# Make predictions 
predictions = model.transform(dataset)
from pyspark.ml.evaluation import ClusteringEvaluator
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

希望对您有所帮助,您可以查看官方文档了解更多信息

通过计算 Silhouette 分数来评估聚类:

在 Spark 3.0.1 及更高版本中

print('Silhouette with squared euclidean distance:')
pdt = model.transform(final_data)
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(pdt)
print(silhouette)

在平方误差的集合总和(wssse)中评估聚类:

spark 2.2 到 3.0.0

cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))

目前版本3.1.2.

以KMeans为例,导入后

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

加载数据和训练,然后调用'ClusteringEvaluator()':

# Make predictions
predictions = model.transform(dataset)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))