为什么 pysark areaUnderROC 与 sklearn roc_auc_score 不同？

Question

我运行很少使用 pyspark 进行二进制分类，我正在使用 BinaryClassificationEvaluator 来评估对测试集所做的预测。如果我使用 sklearn roc_auc_score 为什么我会得到不同的结果？例如：

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier

trainDF, testDF = df.randomSplit([.8, .2], seed=42)
evaluator = BinaryClassificationEvaluator(
    labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")

rf = RandomForestClassifier(labelCol="label", featuresCol="features")
rfModel = rf.fit(trainDF)
prediction = rfModel.transform(testDF)

# now in the DataFrame prediction I have those columns: 'label','prediction','probability'
#  +-----+----------+---------------------+
#  |label|prediction|probability          |
#  +-----+----------+---------------------+
#  |0    |0.0       |[1.0,0.0]            |
#  |0    |0.0       |[0.9765625,0.0234375]|
#  |0    |0.0       |[0.9765625,0.0234375]|
#  +-----+----------+---------------------+

areaUnderROC = evaluator.evaluate(prediction) #IT RETURNS 0.954459

#NOW I USE PANDAS
RF_pred = prediction.select('label', 'prediction', 'probability').toPandas()
probRF=[]
for i in range(prediction.count()):
    probRF.append(RF_pred['probability'][i][1])  #it takes only the probability for the label 1 

auc = roc_auc_score(RF_pred['label'], probRF) #IT RETURNS 0.9962

怎么可能？

Answer 1

我应该使用 probability 列或者保留默认值！这样我就有了相同的值roc_auc_score。我不应该有错误或错误的列吗？

evaluator = BinaryClassificationEvaluator(
                                         labelCol="label",
                                         rawPredictionCol="probability", 
                                         metricName="areaUnderROC")

我希望这会有所帮助，因为即使在官方文档中也没有太多关于 BinaryClassificationEvaluator 的内容，他们只在示例中使用 MulticlassClassificationEvaluator

为什么 pysark areaUnderROC 与 sklearn roc_auc_score 不同？

Why pysark areaUnderROC is different from sklearn roc_auc_score?

scikit-learn

pyspark

apache-spark-ml

apache-spark-mllib