为什么 pysark areaUnderROC 与 sklearn roc_auc_score 不同?
Why pysark areaUnderROC is different from sklearn roc_auc_score?
我 运行 很少使用 pyspark 进行二进制分类,我正在使用 BinaryClassificationEvaluator
来评估对测试集所做的预测。如果我使用 sklearn roc_auc_score 为什么我会得到不同的结果?例如:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier
trainDF, testDF = df.randomSplit([.8, .2], seed=42)
evaluator = BinaryClassificationEvaluator(
labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
rfModel = rf.fit(trainDF)
prediction = rfModel.transform(testDF)
# now in the DataFrame prediction I have those columns: 'label','prediction','probability'
# +-----+----------+---------------------+
# |label|prediction|probability |
# +-----+----------+---------------------+
# |0 |0.0 |[1.0,0.0] |
# |0 |0.0 |[0.9765625,0.0234375]|
# |0 |0.0 |[0.9765625,0.0234375]|
# +-----+----------+---------------------+
areaUnderROC = evaluator.evaluate(prediction) #IT RETURNS 0.954459
#NOW I USE PANDAS
RF_pred = prediction.select('label', 'prediction', 'probability').toPandas()
probRF=[]
for i in range(prediction.count()):
probRF.append(RF_pred['probability'][i][1]) #it takes only the probability for the label 1
auc = roc_auc_score(RF_pred['label'], probRF) #IT RETURNS 0.9962
怎么可能?
我应该使用 probability 列或者保留默认值!这样我就有了相同的值roc_auc_score。我不应该有错误或错误的列吗?
evaluator = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="probability",
metricName="areaUnderROC")
我希望这会有所帮助,因为即使在官方文档中也没有太多关于 BinaryClassificationEvaluator
的内容,他们只在示例中使用 MulticlassClassificationEvaluator
我 运行 很少使用 pyspark 进行二进制分类,我正在使用 BinaryClassificationEvaluator
来评估对测试集所做的预测。如果我使用 sklearn roc_auc_score 为什么我会得到不同的结果?例如:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier
trainDF, testDF = df.randomSplit([.8, .2], seed=42)
evaluator = BinaryClassificationEvaluator(
labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
rfModel = rf.fit(trainDF)
prediction = rfModel.transform(testDF)
# now in the DataFrame prediction I have those columns: 'label','prediction','probability'
# +-----+----------+---------------------+
# |label|prediction|probability |
# +-----+----------+---------------------+
# |0 |0.0 |[1.0,0.0] |
# |0 |0.0 |[0.9765625,0.0234375]|
# |0 |0.0 |[0.9765625,0.0234375]|
# +-----+----------+---------------------+
areaUnderROC = evaluator.evaluate(prediction) #IT RETURNS 0.954459
#NOW I USE PANDAS
RF_pred = prediction.select('label', 'prediction', 'probability').toPandas()
probRF=[]
for i in range(prediction.count()):
probRF.append(RF_pred['probability'][i][1]) #it takes only the probability for the label 1
auc = roc_auc_score(RF_pred['label'], probRF) #IT RETURNS 0.9962
怎么可能?
我应该使用 probability 列或者保留默认值!这样我就有了相同的值roc_auc_score。我不应该有错误或错误的列吗?
evaluator = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="probability",
metricName="areaUnderROC")
我希望这会有所帮助,因为即使在官方文档中也没有太多关于 BinaryClassificationEvaluator
的内容,他们只在示例中使用 MulticlassClassificationEvaluator