如何在sparkml分类中指定"positive class"?

How to specify "positive class" in sparkml classification?

如何在 sparkml(二进制)class化中指定“正 class”? (或者也许:MulticlassClassificationEvaluator 如何确定哪个 class 是“正”?)

假设我们正在训练一个模型以在二元class化问题中以精度为目标,例如...

label_idxer = StringIndexer(inputCol="response",
                            outputCol="label").fit(df_spark)
# we fit so we can get the "labels" attribute to inform reconversion stage

feature_idxer = StringIndexer(inputCols=cat_features,
                              outputCols=[f"{f}_IDX" for f in cat_features],
                              handleInvalid="keep")

onehotencoder = OneHotEncoder(inputCols=feature_idxer.getOutputCols(),
                              outputCols=[f"{f}_OHE" for f in feature_idxer.getOutputCols()])

assembler = VectorAssembler(inputCols=(num_features + onehotencoder.getOutputCols()),
                            outputCol="features")

rf = RandomForestClassifier(labelCol=label_idxer.getOutputCol(),
                            featuresCol=assembler.getOutputCol(),
                            seed=123456789)

label_converter = IndexToString(inputCol=rf.getPredictionCol(),
                                outputCol="prediction_label",
                                labels=label_idxer.labels)

pipeline = Pipeline(stages=[label_idxer, feature_idxer, onehotencoder,
                            assembler,
                            rf,
                            label_converter])  # type: pyspark.ml.pipeline.PipelineModel

crossval = CrossValidator(estimator=pipeline,
                          evaluator=MulticlassClassificationEvaluator(
                              labelCol=rf.getLabelCol(),
                              predictionCol=rf.getPredictionCol(),
                              metricName="weightedPrecision"),  
                          numFolds=3)

(train_u, test_u) = dff.randomSplit([0.8, 0.2])
model = crossval.fit(train_u)

我知道...

Precision = TP / (TP + FP) 

...但是您如何将特定的 class 标签指定为“正 class”来定位精度? (就目前而言,IDK 实际上在训练中使用了哪个响应值,也不知道如何判断)。

来自 spark 邮件列表上的讨论...

The positive class is "1" and negative is "0" by convention; I don't think you can change that (though you can translate your data if needed). F1 is defined only in a one-vs-rest sense in multi-class evaluation. You can set 'metricLabel' to define which class is 'positive' in multiclass - everything else is 'negative'.

请注意,这意味着(没有在 MulticlassEvaluator 中设置 metricLabel)StringIndexer(特别是 stringOrderType 参数 https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html?highlight=stringindexer#pyspark.ml.feature.StringIndexer.stringOrderType ) 将是用户了解他们所说的是他们的 positive/negative class 的地方。 (请注意,根据文档,默认值为 frequencyDesc。如果在 frequencyDesc/Asc 下频率相同,则字符串将按字母顺序进一步排序(即在少数阳性 class 的情况下)你会没事的,否则需要命名遵循 0=neg 1=pos 约定))。

In multi-class, there is no 'positive' class, they're all just classes. It defaults to 0 there but 0 doesn't have any particular meaning. You could apply this to a binary class setup. In that case, you could simply ask for F1 for label 0, and that would compute F1 for '0-vs-rest', and that would be like treating 0 as the 'positive' class for purposes of F1.

关于这种解释的一个问题是,BinaryClassificationEvaluator 似乎没有评估 Fbeta、Recall、Precision 等的能力 (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html?highlight=binaryclassificationevaluator#pyspark.ml.evaluation.BinaryClassificationEvaluator.metricName) whereas the MulticlassClassificationEvaluator does (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html?highlight=classificationevaluator#pyspark.ml.evaluation.MulticlassClassificationEvaluator.metricName),这意味着用户需要切换在两者之间,如果他们想尝试训练模型以 AreaUnderROC 或 F1 为目标,在二进制 classification 的情况下意味着他们需要将正 class 的索引值从 1 切换(在二进制 classification 中,因为你说 1 是传统的正数 class)到 0(对于 multiclass 评估器,因为文档说默认的 metricLabel 是 0)。